mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.12k stars 9.94k forks source link

CJK characters in clipboard after copy of latin text #18099

Closed myfonj closed 2 months ago

myfonj commented 4 months ago

Link to PDF file:

https://web.archive.org/web/20240515102919/https://www.oahovorcovicka.cz/files/soubory/WEB_2023/Vsledky_CR_2024.pdf

Configuration:

Steps to reproduce the problem:

  1. Open PDF from Web Archive (Warning: nearly 10MB payload.)
  2. Select first line
  3. Copy
  4. Paste

What is the expected behaviour? (add screenshot)

Clipboard should read Yýsledková listina přijímacích zkoušek, as it does when executed in SumatraPDF or Acrobat Reader:

SumatraPDF with selection rectanhle around first line and Notepad++ above with single line of text in latin alphabet

(This is almost correct OCR of the scan consisting of latin characters.)

What went wrong? (add screenshot)

Clipboard reads 夀ý猀氀攀搀欀漀瘀á 氀椀猀琀椀渀愀 瀀ř椀樀í洀愀挀í挀栀稀欀漀甀š攀欀:

Firefox window showing webarchived pdf with first line selected and Npp window with pasted text consting of CJK characters.

(This is weird sequence of CJK characters, with few latin glyphs, all with diacritics.)


I see this PDF is really sloppy and there are many OCR errors thorough the document, but I guess it is not relevant.

ArmaandeepSingh commented 3 months ago

Hi @myfonj, Could you please confirm that you are facing this issue with all of the text in the PDF you provided here?

alexcat3 commented 2 months ago

This bug is reproducible on both Mozilla Firefox and Microsoft Edge on Windows 11 with the latest code. All text in the document, including numbers, is copied as Chinese. Meanwhile Microsoft Edge's built-in PDF viewer copies the text correctly.

alexcat3 commented 2 months ago

Actually my previous statement is incorrect: letters with accent marks are copied correctly in pdf.js. Other characters are replaced with CJK.

alexcat3 commented 2 months ago

Looking at the garbled text in a hex editor, it appears that the problem is that ASCII characters were converted to UTF 16 with the wrong endianness.

alexcat3 commented 2 months ago

I have managed to create a minimal (3kb) example file that exhibits the behavior by adding the cmap from the file provided by the user to the sample "hello world" pdf file. helloworld.pdf

alexcat3 commented 2 months ago

It turns out the trouble boils down to one line in the PDF's font's toUnicode CMap which appears intended to map all characters in the range from 00 to 7F-- all the ASCII characters-- to the corresponding unicode characters:

<00> <7F> <00> If you change this line to <00> <7F> <0000> thus specifying the starting unicode value with 2 bytes instead of one, the problem goes away.
alexcat3 commented 2 months ago

I'm confused by the code that handles Bf ranges in CMaps. It seems to treat Javascript strings as an array of bytes, but I thought that as Javascript used UTF 16 they would be an array of 16 bit words.

mapBfRange(low, high, dstLow) {
    if (high - low > MAX_MAP_RANGE) {
      throw new Error("mapBfRange - ignoring data above MAX_MAP_RANGE.");
    }
    const lastByte = dstLow.length - 1;
    while (low <= high) {
      this._map[low++] = dstLow;
      // Only the last byte has to be incremented (in the normal case).
      const nextCharCode = dstLow.charCodeAt(lastByte) + 1;
      if (nextCharCode > 0xff) {
        dstLow =
          dstLow.substring(0, lastByte - 1) +
          String.fromCharCode(dstLow.charCodeAt(lastByte - 1) + 1) +
          "\x00";
        continue;
      }
      dstLow =
        dstLow.substring(0, lastByte) + String.fromCharCode(nextCharCode);
    }
  }
alexcat3 commented 2 months ago

It appears that the above code is actually correct, the code uses the 16 bit characters of a JS string to store 8 bit bytes of the destination char code. It seems that the problem is that that the readToUnicode function in evaluator.js, which uses cmap.js to parse the ToUnicode cmap as a regular cmap, and then converts it into a ToUnicode cmap, assumes that the regular cmap resulting from the parse will have strings where merging each pair of adjacent characters will make a valid UTF-16 string. However, if the PDF file omits the leading zeros on the UTF-16 string, the cmap will end up with a string with an odd number of characters where the first character is a UTF-16 low byte with no high byte to pair it with.

alexcat3 commented 2 months ago

It is unclear if omitting leading zeros on hex-encoded UTF-16 in the ToUnicode cmap is allowed by the PDF spec. However, seeing that there is at least one PDF in the wild that does it and other PDF readers can read it, pdf.js should probably fix it. I will try and make a pull request with a fix. This will be my first ever pull request to an open source project.

myfonj commented 2 months ago

Nice, good luck! I have zero experience with PDF internals, but in unlikely case it hasn't occurred to you, most probably there may be some hints somewhere in the SumatraPDF codebase whether they are doing some "magical fixups" of badly encoded PDFs, and possibly how.

alexcat3 commented 2 months ago

Thank you! I submitted my pull request. https://github.com/mozilla/pdf.js/pull/18390