pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
4.34k stars 139 forks source link

How to retrieve the whole document for a chunk? #69

Open istvan-deak opened 2 months ago

istvan-deak commented 2 months ago

What is your question or problem? Please describe.

I would like to use the long context window of the LLM of my choice and pass whole files to the prompt.

Describe what you would like to happen

During retrieval, I'd like the system to:

  1. First fetch the small chunks as it currently does
  2. Then look up the parent IDs for those chunks
  3. Return the larger documents or even the whole file associated with those parent IDs

This approach would allow for more context to be provided to the LLM, potentially improving its performance on tasks that require broader context.

szymondudycz commented 2 months ago

If you want to use whole files in indexing, then just don't use splitter and make sure parser doesn't split documents (e.g. use 'mode=single' in ParseUnstructured).

Doing exactly what you want, that is indexing over small chunks, but retrieving whole documents is not easily supported, what you can do is write your own splitter that inserts full documents text in the metadata of each chunk, and then after chukns are retrieved rather then using returned text, use the full document text from metadata.

dxtrous commented 2 months ago

@szymondudycz I believe this question has come up a number of times already. Perhaps we should make it into a feature request? The resolution could be e.g. a code template that shows how to have a table of full_document_metadata, a table of chunks with document_id in their metadata, and shows how to retrieve full_document_metadata for a given chunk, and maybe also load/reread the document on demand (with a udf). @istvan-deak if you have any thoughts here, please don't hesitate to share.