Closed ellis-hebbia closed 1 month ago
PDF.js version
4.4
That's not a complete version number, please remember to state the full version when reporting an issue.
(Also, any 4.4.x
version is no longer supported.)
This is possibly related to #12237, however examining the internals of the PDF I didn't find any /ActualText data that could have been used to correct the text.
Duplicate of #12237. In this case the /Contents streams are compressed, which is very common in PDFs, hence you cannot e.g. just open the PDF document in a text-editor and search for occurrences of a word.
/ActualText is there:
/Span << /ActualText (ff) >> BDC
9.6839905 0 Td
(\000{) Tj
EMC
...
/Span << /ActualText (ft) >> BDC
9.071991 0 Td
(\000|) Tj
EMC
Attach (recommended) or Link to PDF file
Chrome Page 1 Terms of Service – Hugging Face.pdf
Web browser and its version
Chrome 127 or Firefox 129
Operating system and its version
Mac OS 14
PDF.js version
4.4
Is the bug present in the latest PDF.js version?
Yes
Is a browser extension
No
Steps to reproduce the problem
In the attached PDF (webpage downloaded from Chrome):
What is the expected behavior?
'ff' and friends should be parsed correctly and therefore be findable and copy-able.
What went wrong?
Can't find the text using the search feature.
The same exact webpage downloaded in Firefox works perfectly fine in the PDF.js viewer, so clearly Chrome pdf exports are formatting the PDF differently in this case. I'm guessing there's some issue with the font or unicode mapping data? Theres also a missing emoji which may be a clue.
Link to a viewer
No response
Additional context
This is possibly related to https://github.com/mozilla/pdf.js/issues/12237, however examining the internals of the PDF I didn't find any /ActualText data that could have been used to correct the text. The chrome PDF viewer and adobe acrobat are able to command + F and copy the words properly.