Open gordoneliel opened 6 months ago
After we've split the text, we reduce
over the produced chunks. Inside the reduce, we figure out the start and end bytes, and put that information on the chunk in question.
I can absolutely see the value in a 'metadata', 'label', 'title' or some other 'custom properties' field! Especially if the chunker is being used in a RAG flow. @estreeper @gk-per what do you think?
Hey, @gordoneliel, can you give us a better idea of the problem you're looking to solve here so we can understand how it fits into the feature set and interface of the library? Maybe some sample code of what you're hoping to achieve?
So you know where I'm coming from: In the interest of keeping the library's API and feature set as narrow as possible, I'm inclined toward patterns like this rather than trying to couple the library's output too closely to a specific app's needs:
document.text
|> TextChunker.split()
|> Enum.map(fn chunk ->
%OurOwnChunkStruct{
document_id: document.id,
text: chunk.text,
start_byte: chunk.start_byte,
#...
}
end)
But I'm very interested in how that may or may not fit your use case and totally open to expanding the functionality if it'll move the needle for you!
This would also be useful for storing say, the page number, of a chunked PDF.
Here's a pretty interesting project where I suggested chunking by content instead of page (but where page # should still need to be tracked): https://github.com/toranb/rag-n-drop/issues/1
When splitting up documents, its helpful to add for example the title of the document that a chunk was extracted from.
The splitter would take in some optional meta that would be passed down:
opts = [chunk_size: 10, chunk_overlap: 5, metadata: %{doc_title: "My doc"}] chunks = TextChunker.split(text, opts)
Then each chunk would inherit the metadata.
Another option would be to add a title/label prop instead of metadata.
What are your thoughts on this? How are you currently adding info to the chunks you're splitting?