modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.98k stars 378 forks source link

pdf2json hangs in fonts.js in Font_buildToFontChar, #184

Open labsnoir opened 5 years ago

labsnoir commented 5 years ago

The parsing of one PDF I try to read needs very long and the output result looks like eastern asian symbols although it should be german letters. The length of the array "toUnicode" in fonts.js is 4294967293 and most elements in it are undefined. The traversal of this array take some minutes in buildToFontChar(). Other PDFs can get parsed without problems immediately. Unfortunately I cannot provide the document as it contains private information. If you need further information or if I can check something, please tell me.

Some more information:

Nevertheless: thank you for this great project!

Edit: It seems the problem is somewhere in readToUnicode() in evaluator.js. The big size of "toUnicode" is coming from the german umlauts "ä", "ö", "ü" und "ß".

labsnoir commented 5 years ago

If the unicode conversion in evaluator.js in method readToUnicode() is commented out, (the lines beginning with // Convert "UTF-16BE"), the document is parsed correctly, although it is still very slow.

I don't know the internals of pdf, but could it be that there is something wrong with the meta data in that specific pdf file, for example information about encoding?

labsnoir commented 5 years ago

All in all in my case it would help if readToUnicode() in evaluator.js could be disabled, for example by a parameter in PdfParser.

sanath1188 commented 3 years ago

Did you get around to fixing this issue?