megacamelus / camel-assistant

Apache License 2.0
1 stars 6 forks source link

Investigate chunking strategies #74

Open orpiske opened 1 month ago

orpiske commented 1 month ago

We need to investigate chunking strategies that can help the assistant provide better answers:

lburgazzoli commented 1 month ago

IMHO, this work should end up being part of lanchain4j and we can eventually use is as one of the tokenize strategy in Apache Camel

oscerd commented 1 month ago

I don't think it's something that should go in Camel. Camel is an integration framework, tokenizing is a feature related to something else.

orpiske commented 1 month ago

IMHO, this work should end up being part of lanchain4j and we can eventually use is as one of the tokenize strategy in Apache Camel

Yeah.

I also don't see it as being part of camel, as rightly pointed by @oscerd. It could be used by it, though.

So, I think a reasonable approach would be to create a Java library and then work to include support for it on langchain4j and Quarkus.

lburgazzoli commented 1 month ago

I would then move this discussion to the langchain4j issue tacker so they may provide some additional info/suggestion as they may have had the chance to think about it already

orpiske commented 1 month ago

For reference, here's a discussion with the Langchain4j project. Their suggestion is to look at the DocumentSplitter interface and work on top of that.

https://github.com/langchain4j/langchain4j/issues/1081