Closed SamGalanakis closed 6 months ago
As this is a fairly new feature, we are constantly looking into ways for improving it.
You are quite right: Creating one giant MD text string for the whole document may not always be the most handy thing to have - e.g. when it comes to chunking requirements.
We are accepting specialized methods for major LLM / RAG implementors which are based on to_markdown
.
What we are seeing is the segmentation of the total MD text into a list of single-page outputs, where each page item is represented by a dictionary containing the text itself plus some metadata (document info, page number, ...).
This approach also more easily lends itself parallelization for speeding up things: MD text results of multiple independent processes can be collected to form the final list ...
You could help by providing information on which data should be part of the page output.
Chunking on page level is however only one of several alternatives. Other approaches may want to see this happening on text header levels (like starting a new chunk on every level 1 header etc.). This could also be done of course, but wouldn't be parallelizable by its very nature, and would also be less reliable because of source document quality issues ... just to mention two of multiple issues here.
The latest version 0.0.2 has the option to generate per-page results:
data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)
# each item in the list 'data' is a dictionary with keys
# "metadata' and 'text'.
# 'metadata' is again a dictionary with the doc.metadata content enriched by path name and page number.
Would be nice to keep track of the line index in the markdown of the page breaks in the original pdf so it can be used down the line for say citation in RAG. Can look into implementing this if it makes sense.