mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.32k stars 9.97k forks source link

The copied text does not match the displayed one #12521

Open Andrei-Zinouyeu opened 3 years ago

Andrei-Zinouyeu commented 3 years ago

Attach (recommended) or Link to PDF file here:

Configuration:

Steps to reproduce the problem:

  1. Open IncorrectCopyOperation.pdf

  2. Select and copy text (all text from any column)

  3. Paste in any Editor.

What is the expected behavior? (add screenshot) The displayed and copied texts must be the same.

What went wrong? (add screenshot) The text does not match the displayed one.

Note: There is no problem in Adobe Acrobat reader.

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

marcosnery commented 3 years ago

I had problem copying text from this pdf to an editor. It seems that the problem occurred with text that contains bold, capitalized and italic text. In addition we got the same problem/result reading with NVDA screen reader.

The pdf text:

Certifico que digitalizei e anexei aos presentes autos, nesta data, o COMPROVANTE DE ENTREGA DE OBJETO AO DESTINATÁRIO, referente à notificação de ID C1234567, encaminhada ao Impetrante IMPORTADORA DE PRODUTOS, cumprida em 22/04/2021, com diligência positiva.

The result after the copy/paste in editor:

Certifico que e anexei aos presentes autos, nesta data, digitalizei o , referente à COMPROVANTE DE ENTREGA DE OBJETO AO DESTINATÁRIOnotificação de ID C123456, encaminhada ao Impetrante IMPORTADORA DE , cumprida em 22/04/2021, com diligência positiva.PRODUTOS

Updating with more info after new tests...

It seems that the extractor is getting the text in the same sequence order as the PDF flow commands, regardless of their target positions on the page. This is causing text extraction to depend on how the PDF document was generated.

This piece of the stream show this...

BT
/F1 12 Tf
1 0 0 1 73.6875 688.01251 Tm
(Certifico que ) Tj
ET
BT
/F1 12 Tf
1 0 0 1 260.88751 688.01251 Tm
(e anexei aos presentes autos, nesta data, ) Tj
ET
BT
/F2 12 Tf
1 0 0.21256 1 174.4875 688.01251 Tm
(digitalizei ) Tj
ET
BT
/F1 12 Tf
1 0 0 1 73.6875 669.75 Tm
(o ) Tj
ET
BT
/F1 12 Tf
1 0 0 1 433.6875 669.75 Tm
<2C207265666572656E746520E020> Tj   % <- , referente à 
ET
BT
/F3 12 Tf
1 0 0 1 88.0875 669.75 Tm
<434F4D50524F56414E544520444520454E5452454741204445204F424A45544F20414F2044455354494E4154C152494F> Tj   % <- COMPROVANTE DE ENTREGA DE OBJETO AO DESTINATÁRIO
ET
BT
/F1 12 Tf
1 0 0 1 73.6875 651.52502 Tm
<6E6F746966696361E7E36F20646520494420433132333435362C20656E63616D696E6861646120616F20496D70657472616E746520> Tj  % <- notificação de ID C123456, encaminhada ao Impetrante 
ET
BT
/F3 12 Tf
1 0 0 1 455.28751 651.52502 Tm
(IMPORTADORA DE ) Tj
ET 
Andrei-Zinouyeu commented 2 years ago

Do we have updates here? How to proceed in such a case?

calixteman commented 2 years ago

The pdf contains the same text two times:

So it isn't really a bug... but a missing feature: Acrobat and Chrome handle this stuff correctly. We would need to track each chars and their position in order to be able to know that a new char in the stream is at almost the same position as the another one.

Andrei-Zinouyeu commented 2 years ago

Do you plan to add this feature in the next releases?

calixteman commented 2 years ago

Not really... my feeling (which could be wrong) is that it's a low priority stuff because it's a corner case and as far as I can tell it deserves to be well-thought-out in order to avoid any perf penalties and any additional code complexity. So maybe in few months... just don't ask me to define "few". But as usual, if you've a PR, we'll be happy to review it.

Andrei-Zinouyeu commented 2 years ago

Is this case going to be fixed this year?

aureliomjr commented 4 months ago

Greetings, @calixteman Just for additional information. We are experiencing some problems related with this issue that impacts blind or visually impaired users that needs to use software to read the screen. Thank you.