pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
392 stars 68 forks source link

Handling Graphical Images & Superscripts #116

Open SBhat2615 opened 2 months ago

SBhat2615 commented 2 months ago
  1. Embedded images are extracted to a dedicated folder, which i observed for some of the documents.

There are some graphical images in the below pdf which are not getting extracted to separate folder.

2. There are also superscripts in the pdf, which are not referenced.

sample_document.pdf

JorjMcKie commented 2 months ago

Please provide the script you used.

SBhat2615 commented 2 months ago

Please provide the script you used.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(input_path, write_images=True)

output = open(output_path, "w")
output.write(md_text)
output.close()
JorjMcKie commented 2 months ago

Don't let me guess please: On which page are you missing what?

SBhat2615 commented 2 months ago

Don't let me guess please: On which page are you missing what?

  1. Figure 1 and 2 are not extracted as image.
  2. Table 3, 5, 6 is not extracted as image.

sample_document.md

SBhat2615 commented 2 months ago

For superscripts, if we can get output similar to this, that would be good as well.

Screenshot 2024-08-27 at 11 14 29 AM
CedricLor commented 1 month ago

As regards the superscript handling improvement request, I guess what you're looking for is a feature handling footnotes and footnote references.

This would obviously be useful but it would imply a major refactoring.

For a naive approach, it would mean first detecting superscript text within the body text (this is already here), saving them in some data structure for further processing, then detecting and differentiating the footnotes from the body text on the page, then matching the footnotes with the references.

Since the footnotes are usually located at the bottom of the page and the footnote references inside the body text and pymupdf4llm generates the string linearly, this would mean that the script would need to use the saved references to try and match the beginning of the lines at the bottom of page. So far, not that difficult.

However, this would then mean that once the footnote has been matched, we would have to go back into the string to create the reference.

However, sometimes, footnote references are incremented at page level and their index is reset on each page which would mean that in a single md string for a multi page document, there would be ambiguous footnotes and footnote references, meaning that the script would also need to handle an eventual re-numbering.

Some documents also include simultaneously various symbols for the footnote references (e.g. numbers and roman numbers, for instance, to differentiate the author's footnotes from the publisher's or the translator's footnotes) and these would also need to be differentiated and tracked in the data structure.

Finally, superscript text might also be references to endnotes or mark other information (e.g. "tm", copyright symbol, the "o" in a number symbol on "no", aso.).

All this processing would probably have some performance impact.

So while the feature would obviously be welcome, this makes it almost a package on its own and I personally think that it would probably be better handled in a post-processing script of its own doing only this and doing it well instead of directly into pymupdf4llm.

JorjMcKie commented 1 month ago

@CedricLor - thank you for your thoughtful assessment on footnotes. I totally agree with you: This is something we will probably never support for all the reasons you were mentioning: simply out of scope.