Open SBhat2615 opened 2 months ago
Please provide the script you used.
Please provide the script you used.
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(input_path, write_images=True)
output = open(output_path, "w")
output.write(md_text)
output.close()
Don't let me guess please: On which page are you missing what?
Don't let me guess please: On which page are you missing what?
For superscripts, if we can get output similar to this, that would be good as well.
As regards the superscript handling improvement request, I guess what you're looking for is a feature handling footnotes and footnote references.
This would obviously be useful but it would imply a major refactoring.
For a naive approach, it would mean first detecting superscript text within the body text (this is already here), saving them in some data structure for further processing, then detecting and differentiating the footnotes from the body text on the page, then matching the footnotes with the references.
Since the footnotes are usually located at the bottom of the page and the footnote references inside the body text and pymupdf4llm generates the string linearly, this would mean that the script would need to use the saved references to try and match the beginning of the lines at the bottom of page. So far, not that difficult.
However, this would then mean that once the footnote has been matched, we would have to go back into the string to create the reference.
However, sometimes, footnote references are incremented at page level and their index is reset on each page which would mean that in a single md string for a multi page document, there would be ambiguous footnotes and footnote references, meaning that the script would also need to handle an eventual re-numbering.
Some documents also include simultaneously various symbols for the footnote references (e.g. numbers and roman numbers, for instance, to differentiate the author's footnotes from the publisher's or the translator's footnotes) and these would also need to be differentiated and tracked in the data structure.
Finally, superscript text might also be references to endnotes or mark other information (e.g. "tm", copyright symbol, the "o" in a number symbol on "no", aso.).
All this processing would probably have some performance impact.
So while the feature would obviously be welcome, this makes it almost a package on its own and I personally think that it would probably be better handled in a post-processing script of its own doing only this and doing it well instead of directly into pymupdf4llm.
@CedricLor - thank you for your thoughtful assessment on footnotes. I totally agree with you: This is something we will probably never support for all the reasons you were mentioning: simply out of scope.
Embedded images are extracted to a dedicated folder, which i observed for some of the documents.
There are some graphical images in the below pdf which are not getting extracted to separate folder.
2. There are also superscripts in the pdf, which are not referenced.
sample_document.pdf