mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.32k stars 9.97k forks source link

PDF.js does not take into account /ActualText #12237

Open TimotheAlbouy opened 4 years ago

TimotheAlbouy commented 4 years ago

ESET_Okrum_and_Ketrican.pdf

I'm using PDF.js 2.4.456. The issue I'm describing here is independent of the OS and the web browser.

To reproduce the problem, open the attached document with PDF.js (a quick way to reproduce the bug is to open the document in Firefox), then copy and paste some of the text. In the extracted text, the period characters are missing for the main font:

The Ke3chang group, also known as APT15, is a threat group believed to be operating out of China Its attacks were first reported in 2012,

This is due to a bad mapping in the /ToUnicode property of the font: the 0x2E code (i.e. the period character in ASCII) in the text is mapped to U+0020 (a space in UTF-16).

But when we copy-paste the content of the PDF using Adobe Acrobat, we extract all the period characters correctly. It is because Acrobat takes into account the /ActualText marked text properties inside the PDF.

This is what we see when we open the document using a PDF inspector like PDFDebugger of PDFBox:

(, also known as APT15, is a threat group believed to be operating out of China)Tj
/Span<</ActualText<FEFF002E>>> BDC 
(.)Tj
EMC

We see that the 0x2E code in (.)Tj, which according to the /ToUnicode map represents a space character, is marked to actually represent U+FEFF (a BOM) and U+002E in, a period character.

Thus, Acrobat Reader extract the periods correctly in the given report because it takes into account the /ActualText content, whereas PDF.js doesn't extract the periods because it only considers the /ToUnicode map.

Is there an ongoing effort to fix this problem?

Snuffleupagus commented 4 years ago

Duplicate of #12100 (please don't knowingly open duplicate issues).

TimotheAlbouy commented 4 years ago

It wasn't clear that the issue came from the ActualText property before, we thought that PDF.js did the job right because it followed the ToUnicode map. I would rather close/delete the old issue and leave this one which is clearer, if you agree.

timvandermeij commented 4 years ago

I closed the other issue since this one contains a bit more details, but we should indeed keep the discussion limited to this issue from now on.

Mowmowj commented 8 months ago

hi @TimotheAlbouy would you think it's possible to do a preliminary work to fix the unicode mapping problem of the pdf file itself? then may the pdf.js would render the text layer normally.

TimotheAlbouy commented 8 months ago

@Mowmowj of course this would be possible, but it would not address the root problem: currently PDF.js cannot extract some text that Acrobat Reader can. You wouldn't tell a visually impaired person to change their book instead of replacing their faulty glasses.

joelostblom commented 3 months ago

I ran into this issue when I noticed that PDF.js can't find text strings such as fi, tt, etc if these are represented as ligatures in the PDF documents. I noticed that the PDF viewer Evince is able to find search text even in PDFs that do contain ligatures. Since evince is open source, maybe there is something in their approach that could be useful for PDF.js as well? I'm not familiar with exactly what that would be, but a brief search in their repo shows that they e.g. do some normalization of the text in the PDF https://gitlab.gnome.org/GNOME/evince/-/commit/9de1152cd935d9f00f2709052d25d42b18cb1b0f

Mowmowj commented 3 months ago

I ran into this issue when I noticed that PDF.js can't find text strings such as fi, tt, etc if these are represented as ligatures in the PDF documents. I noticed that the PDF viewer Evince is able to find search text even in PDFs that do contain ligatures. Since evince is open source, maybe there is something in their approach that could be useful for PDF.js as well? I'm not familiar with exactly what that would be, but a brief search in their repo shows that they e.g. do some normalization of the text in the PDF https://gitlab.gnome.org/GNOME/evince/-/commit/9de1152cd935d9f00f2709052d25d42b18cb1b0f

Good point, looks like it's use the Glib, BTW for inspiration, the mupdf viewer also supports the special string. but i'm not sure how they figure out it.