Closed myfonj closed 2 months ago
Hi @myfonj, Could you please confirm that you are facing this issue with all of the text in the PDF you provided here?
This bug is reproducible on both Mozilla Firefox and Microsoft Edge on Windows 11 with the latest code. All text in the document, including numbers, is copied as Chinese. Meanwhile Microsoft Edge's built-in PDF viewer copies the text correctly.
Actually my previous statement is incorrect: letters with accent marks are copied correctly in pdf.js. Other characters are replaced with CJK.
Looking at the garbled text in a hex editor, it appears that the problem is that ASCII characters were converted to UTF 16 with the wrong endianness.
I have managed to create a minimal (3kb) example file that exhibits the behavior by adding the cmap from the file provided by the user to the sample "hello world" pdf file. helloworld.pdf
It turns out the trouble boils down to one line in the PDF's font's toUnicode CMap which appears intended to map all characters in the range from 00 to 7F-- all the ASCII characters-- to the corresponding unicode characters:
<00> <7F> <00> If you change this line to <00> <7F> <0000> thus specifying the starting unicode value with 2 bytes instead of one, the problem goes away.I'm confused by the code that handles Bf ranges in CMaps. It seems to treat Javascript strings as an array of bytes, but I thought that as Javascript used UTF 16 they would be an array of 16 bit words.
mapBfRange(low, high, dstLow) {
if (high - low > MAX_MAP_RANGE) {
throw new Error("mapBfRange - ignoring data above MAX_MAP_RANGE.");
}
const lastByte = dstLow.length - 1;
while (low <= high) {
this._map[low++] = dstLow;
// Only the last byte has to be incremented (in the normal case).
const nextCharCode = dstLow.charCodeAt(lastByte) + 1;
if (nextCharCode > 0xff) {
dstLow =
dstLow.substring(0, lastByte - 1) +
String.fromCharCode(dstLow.charCodeAt(lastByte - 1) + 1) +
"\x00";
continue;
}
dstLow =
dstLow.substring(0, lastByte) + String.fromCharCode(nextCharCode);
}
}
It appears that the above code is actually correct, the code uses the 16 bit characters of a JS string to store 8 bit bytes of the destination char code. It seems that the problem is that that the readToUnicode function in evaluator.js, which uses cmap.js to parse the ToUnicode cmap as a regular cmap, and then converts it into a ToUnicode cmap, assumes that the regular cmap resulting from the parse will have strings where merging each pair of adjacent characters will make a valid UTF-16 string. However, if the PDF file omits the leading zeros on the UTF-16 string, the cmap will end up with a string with an odd number of characters where the first character is a UTF-16 low byte with no high byte to pair it with.
It is unclear if omitting leading zeros on hex-encoded UTF-16 in the ToUnicode cmap is allowed by the PDF spec. However, seeing that there is at least one PDF in the wild that does it and other PDF readers can read it, pdf.js should probably fix it. I will try and make a pull request with a fix. This will be my first ever pull request to an open source project.
Nice, good luck! I have zero experience with PDF internals, but in unlikely case it hasn't occurred to you, most probably there may be some hints somewhere in the SumatraPDF codebase whether they are doing some "magical fixups" of badly encoded PDFs, and possibly how.
Thank you! I submitted my pull request. https://github.com/mozilla/pdf.js/pull/18390
Link to PDF file:
https://web.archive.org/web/20240515102919/https://www.oahovorcovicka.cz/files/soubory/WEB_2023/Vsledky_CR_2024.pdf
Configuration:
Steps to reproduce the problem:
What is the expected behaviour? (add screenshot)
Clipboard should read
Yýsledková listina přijímacích zkoušek
, as it does when executed in SumatraPDF or Acrobat Reader:(This is almost correct OCR of the scan consisting of latin characters.)
What went wrong? (add screenshot)
Clipboard reads
夀ý猀氀攀搀欀漀瘀á 氀椀猀琀椀渀愀 瀀ř椀樀í洀愀挀í挀栀稀欀漀甀š攀欀
:(This is weird sequence of CJK characters, with few latin glyphs, all with diacritics.)
I see this PDF is really sloppy and there are many OCR errors thorough the document, but I guess it is not relevant.