[Question] How to store references to pdf pages in chunks?

microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.

https://microsoft.github.io/kernel-memory

MIT License

1.63k stars 314 forks source link

[Question] How to store references to pdf pages in chunks? #449

Closed clarity99 closed 7 months ago

clarity99 commented 7 months ago

Context / Scenario

I want to be able to reference pages from which chunks come from. Basically, I'd like to have similar results as when using chatpdf.com, where in LLM answer there are references to pages from where responses are coming.

Question

When searching I'd like to be able to reference the pages from which the answers come to. From my cursory inspection it seems I'd have to override the chunking algorithm to make chunks being equal pages in pdf? or is there another way?

dluc commented 7 months ago

we just merged https://github.com/microsoft/kernel-memory/pull/415 that adds download URLs to search/ask results - I believe that's what you're looking for

If you mean the exact page, e.g. page 1, page 5 - currently that works only for PDF, Excel and PowerPoint IIRC.

Web pages and text/markdown files don't support pagination.
Word docs are processed using OpenXML and pagination is not reliable (see https://stackoverflow.com/questions/39992870/how-to-access-openxml-content-by-page-number). To get a reliable pagination of word docs, one would have to turn them to PDF first

clarity99 commented 7 months ago

thank you, but if I understand correctly this does not address my question - I don't want to reference the file (I already get that data back when I search), but the actual page in the PDF from where the chunk came from. When you say that this currently only works for PDF, Excel and Powerpoint - what do you mean? KernelMemory already supports this in some way? or that it's possible to get pages for these formats and not for Word and others?

dluc commented 7 months ago

The feature is not ready yet, because the text chunker doesn't support chunks metadata (e.g. the page number) yet. So, the answer is that currently KM doesn't provide the page number. The page number is extracted though, it's just lost during the ingestion pipeline, you can find references in the code to "SectionNumber", which is the page number for PDF, the slide number for powerpoint, and so on. The value is always 1 for Word pages, because OpenXML doesn't support exact pagination.

When using the Search and Ask methods, you'll see that there's a SectionNumber value in the result. Currently the value is always 1, for any doc type, because of the work left to do in the text chunker.

clarity99 commented 7 months ago

thank you! any ideas on the timeline for when this will be implemented?

dluc commented 7 months ago

At this point, there is no set timeline for the implementation of this feature, and it is not currently planned. However, we are open to contributions from the community. A good starting point would be to write a new text chunker capable of retaining metadata.

clarity99 commented 6 months ago

thank you, i'll take a look and see. Is there some documentation how the pipeline works? From a little exploration I did it seems text decoder saves the text in just a plain file and then text chunker doesn't have any way of getting the pages anymore. is there some metadata that gets passed between stages, or is it all done through files?