pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Chunking of text files #52

Closed zymbuzz closed 3 months ago

zymbuzz commented 3 months ago

Hello, and thanks a lot for this encouraging project.

I want to bring to your attention the issue arising when importing text files. myselectdoc.txt

When applying the to_markdown function, the text is chunked into pieces with the separator "-----". Is it possible to disable chunking?

When I specify input margins=[0, 0, 0, 0], some text is not lost, but chunking still does not happen.

The issue may be irrelevant to your project goal, but it might be interesting because you allow for non-pdf files.

JorjMcKie commented 3 months ago

Thank you for your interest and appreciation! No issue is irrelevant 😉.

A few words on the specifics for some of the many supported document types: There are documents with a fixed page layout like PDF or XPS. And then there are others with a variable layout which we call "reflowable". Examples for reflowable types are e-books (EPUB, MOBI), but also HTML, TEXT and Office documents. For these types, PyMuPDF is forced to start with standard assumptions about page size (400 x 600) because there exist no recognizable page breaks and page widths. However, PyMuPDF is capable to change these standard assumptions after open via its method Document.layout().

We are planning to offer additional arguments for method to_markdown like page_width together with the option to assume that the whole reflowable document consists of just one page with that width. Maybe page_height=infinite with some suitable expression. Based on this, you wouldn't ever see again those page separator "-----".

JorjMcKie commented 3 months ago

Solved in version 0.0.6.