Open brandenkmurray opened 3 months ago
As announced in my e-mail, here is script that can be used as a circumvention while the team is working on a final solution. repair-words.zip
here is script that can be used as a circumvention while the team is working on a final solution
This will definitely be known to the team, but noting here just for completeness and if someone searches...this issue is more fundamental than words
, as even the rawdict
format has it. Thanks!
Description of the bug
In some cases PyMuPDF is adding newline characters in the middle of words which do no exist if you simply copy/paste the text from the PDF or extract the text using other libraries.
How to reproduce the bug
wellsfargo-2022-annual-report.pdf
The text from the footnotes in this example look okay using
pdfplumber
andpdftotext
, but withpymupdf
it outputs text that looks like(1) \nAmounts r\n epresent \n the r\n ecorded \n investment \n in loa\n \nns a\n fter \n recognizing \n the effect\n \ns of t\n \n he TD\n \nR, \n if a\n ny.
with\n
scattered throughout.PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.10