mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.33k stars 9.97k forks source link

[Bug]: ff ligature (and friends) can't be found and is improperly copied #18662

Closed ellis-hebbia closed 1 month ago

ellis-hebbia commented 1 month ago

Attach (recommended) or Link to PDF file

Chrome Page 1 Terms of Service – Hugging Face.pdf

Web browser and its version

Chrome 127 or Firefox 129

Operating system and its version

Mac OS 14

PDF.js version

4.4

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

In the attached PDF (webpage downloaded from Chrome):

What is the expected behavior?

'ff' and friends should be parsed correctly and therefore be findable and copy-able.

What went wrong?

Can't find the text using the search feature.

The same exact webpage downloaded in Firefox works perfectly fine in the PDF.js viewer, so clearly Chrome pdf exports are formatting the PDF differently in this case. I'm guessing there's some issue with the font or unicode mapping data? Theres also a missing emoji which may be a clue.

Link to a viewer

No response

Additional context

This is possibly related to https://github.com/mozilla/pdf.js/issues/12237, however examining the internals of the PDF I didn't find any /ActualText data that could have been used to correct the text. The chrome PDF viewer and adobe acrobat are able to command + F and copy the words properly.

Snuffleupagus commented 1 month ago

PDF.js version

4.4

That's not a complete version number, please remember to state the full version when reporting an issue. (Also, any 4.4.x version is no longer supported.)

This is possibly related to #12237, however examining the internals of the PDF I didn't find any /ActualText data that could have been used to correct the text.

Duplicate of #12237. In this case the /Contents streams are compressed, which is very common in PDFs, hence you cannot e.g. just open the PDF document in a text-editor and search for occurrences of a word.

THausherr commented 1 month ago

/ActualText is there:

        /Span << /ActualText (ff) >> BDC
          9.6839905 0 Td
          (\000{) Tj
        EMC
...
        /Span << /ActualText (ft) >> BDC
          9.071991 0 Td
          (\000|) Tj
        EMC