Closed dentro-innovation closed 1 month ago
Same behavior with python 3.10.12, no text in the output file with 0.0.17 but expected output with 0.0.16
However, when setting write_images=True
in 0.0.17, the images are referenced and the image extraction is far better than with 0.0.16 which extracts whole pages as images.
Output with 0.0.17 and write_images=True
:
![](input.pdf-0-0.png)
-----
![](input.pdf-1-0.png)
-----
![](input.pdf-2-0.png)
-----
![](input.pdf-3-0.png)
![](input.pdf-3-1.png)
-----
![](input.pdf-4-0.png)
![](input.pdf-4-1.png)
-----
![](input.pdf-5-0.png)
![](input.pdf-5-1.png)
-----
![](input.pdf-6-0.png)
![](input.pdf-6-1.png)
![](input.pdf-6-2.png)
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
-----
![](input.pdf-18-0.png)
-----
![](input.pdf-19-0.png)
![](input.pdf-19-1.png)
-----
-----
Where is the reproducing file please?
My bad forgot.
Btw the behavior is the same on localhost and on an ubuntu server that I just tried.
PDF in question:
Also got more PDFs of the same manual provider which don't work properly if you want to test with more PDFs
We are now deliberately ignoring text with a smaller font size than 3. Do you need such stuff?
Oh I see.
Well yes I'd need it for that use case. I got this PDF by "printing" it from this website: https://manuales.x-28.com/m/N4-MPXH/1/instalador.html Maybe such small font occurs often when printing websites in such fashion?
Perhaps the user can decide until which font size pymupdf4llm should export text?
Before we jump to conclusions: Here is a script that does print text. Maybe that the margins value play the major role:
import pathlib
import pymupdf
import pymupdf4llm
doc = pymupdf.open("input.pdf")
md = pymupdf4llm.to_markdown(
doc,
margins=0,
)
pathlib.Path(doc.name + ".md").write_bytes(md.encode())
Before I give confirmations: I have no idea how PyMuPDF works under the hood apart from running the basic commands.
This script seems to work marvelously! It does add a bit too much whitespace in front of a sentence, but it works on 0.0.17 !
This is my output with 0.0.17:
This is my output with 0.0.16 (just first page):
I'm using python 3.12.2 on ubuntu.