Open labsnoir opened 5 years ago
If the unicode conversion in evaluator.js in method readToUnicode() is commented out, (the lines beginning with // Convert "UTF-16BE"), the document is parsed correctly, although it is still very slow.
I don't know the internals of pdf, but could it be that there is something wrong with the meta data in that specific pdf file, for example information about encoding?
All in all in my case it would help if readToUnicode() in evaluator.js could be disabled, for example by a parameter in PdfParser.
Did you get around to fixing this issue?
The parsing of one PDF I try to read needs very long and the output result looks like eastern asian symbols although it should be german letters. The length of the array "toUnicode" in fonts.js is 4294967293 and most elements in it are undefined. The traversal of this array take some minutes in buildToFontChar(). Other PDFs can get parsed without problems immediately. Unfortunately I cannot provide the document as it contains private information. If you need further information or if I can check something, please tell me.
Some more information:
Nevertheless: thank you for this great project!
Edit: It seems the problem is somewhere in readToUnicode() in evaluator.js. The big size of "toUnicode" is coming from the german umlauts "ä", "ö", "ü" und "ß".