revelrylabs / text_chunker_ex

A library for semantically coherent text chunking
MIT License
66 stars 4 forks source link

Add metadata to chunk #14

Open gordoneliel opened 6 months ago

gordoneliel commented 6 months ago

When splitting up documents, its helpful to add for example the title of the document that a chunk was extracted from.

The splitter would take in some optional meta that would be passed down:

opts = [chunk_size: 10, chunk_overlap: 5, metadata: %{doc_title: "My doc"}] chunks = TextChunker.split(text, opts)

Then each chunk would inherit the metadata.

%TextChunk{
 ...existing_props,
metadata: %{doc_title: "My doc"}
}

Another option would be to add a title/label prop instead of metadata.

What are your thoughts on this? How are you currently adding info to the chunks you're splitting?

stuartjohnpage commented 5 months ago

After we've split the text, we reduce over the produced chunks. Inside the reduce, we figure out the start and end bytes, and put that information on the chunk in question.

I can absolutely see the value in a 'metadata', 'label', 'title' or some other 'custom properties' field! Especially if the chunker is being used in a RAG flow. @estreeper @gk-per what do you think?

gk-per commented 5 months ago
grossvogel commented 5 months ago

Hey, @gordoneliel, can you give us a better idea of the problem you're looking to solve here so we can understand how it fits into the feature set and interface of the library? Maybe some sample code of what you're hoping to achieve?

So you know where I'm coming from: In the interest of keeping the library's API and feature set as narrow as possible, I'm inclined toward patterns like this rather than trying to couple the library's output too closely to a specific app's needs:

document.text
|> TextChunker.split()
|> Enum.map(fn chunk ->
  %OurOwnChunkStruct{
    document_id: document.id,
    text: chunk.text,
    start_byte: chunk.start_byte,
    #...
  }
end)

But I'm very interested in how that may or may not fit your use case and totally open to expanding the functionality if it'll move the needle for you!

cpursley commented 4 months ago

This would also be useful for storing say, the page number, of a chunked PDF.

Here's a pretty interesting project where I suggested chunking by content instead of page (but where page # should still need to be tracked): https://github.com/toranb/rag-n-drop/issues/1