modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.97k stars 376 forks source link

Parser failing with TypeError: jsFuncName.split is not a function on specific PDF file #105

Open gpt-anuj opened 7 years ago

gpt-anuj commented 7 years ago

ATO_tax_form.pdf I am using version 1.1.7. For most of the cases parses works beautifully. For one file pdf parsing fails with the exception. I have checked file on other services like which does a PDF Parsing and the PDF is valid fillable pdf and many services are able to extract information out of it. I am attaching pdf file also for reference. Any inputs on the issue would help.

code snippet :: let PDFParser = require('pdf2json'); // pdf2json version is 1.1.7 let pdfParser = new PDFParser(); pdfParser.loadPDF('./testingfile.pdf');

Exception trace ::: TypeError: jsFuncName.split is not a function at processFieldAttribute (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:115:33) at /Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:96:17 at Object.Dict_forEach [as forEach] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4826:9) at setupFieldAttributes (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:94:14) at Function.cls.processAnnotation (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:192:13) at TextWidgetAnnotation.WidgetAnnotation [as constructor] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3827:15) at new TextWidgetAnnotation (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3847:22) at Function.Annotation_fromRef [as fromRef] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3724:22) at Object.annotations (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4424:37) at LocalPdfManager_ensure [as ensure] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :32503:22) at Object.Page_getOperatorList [as getOperatorList] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4352:43) at Object.eval [as onResolve] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :27397:14) at Object.runHandlers (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :864:35) at Timer.listOnTimeout (timers.js:92:15) [14/01/2017 13:14:33.236 GMT+1100] [ERROR] parsePage error:An error occurred while rendering the page 1: jsFuncName.split is not a function: TypeError: jsFuncName.split is not a function at processFieldAttribute (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:115:33) at /Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:96:17 at Object.Dict_forEach [as forEach] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4826:9) at setupFieldAttributes (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:94:14) at Function.cls.processAnnotation (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:192:13) at TextWidgetAnnotation.WidgetAnnotation [as constructor] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3827:15) at new TextWidgetAnnotation (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3847:22) at Function.Annotation_fromRef [as fromRef] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3724:22) at Object.annotations (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4424:37) at LocalPdfManager_ensure [as ensure] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :32503:22) at Object.Page_getOperatorList [as getOperatorList] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4352:43) at Object.eval [as onResolve] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :27397:14) at Object.runHandlers (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :864:35) at Timer.listOnTimeout (timers.js:92:15) [14/01/2017 13:14:33.236 GMT+1100] [ERROR] Pdf Parse error ::parsePage error:An error occurred while rendering the page 1: jsFuncName.split is not a function: TypeError: jsFuncName.split is not a function at processFieldAttribute (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:115:33) at /Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:96:17 at Object.Dict_forEach [as forEach] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4826:9) at setupFieldAttributes (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:94:14) at Function.cls.processAnnotation (/Users/test/pdf/node_modules/pdf2json/lib/pdfanno.js:192:13) at TextWidgetAnnotation.WidgetAnnotation [as constructor] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3827:15) at new TextWidgetAnnotation (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3847:22) at Function.Annotation_fromRef [as fromRef] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :3724:22) at Object.annotations (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4424:37) at LocalPdfManager_ensure [as ensure] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :32503:22) at Object.Page_getOperatorList [as getOperatorList] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :4352:43) at Object.eval [as onResolve] (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :27397:14) at Object.runHandlers (eval at (/Users/test/pdf/node_modules/pdf2json/lib/pdf.js:64:6), :864:35) at Timer.listOnTimeout (timers.js:92:15)

mnutsch commented 6 years ago

Is there any update on this? This seems to occur when processing a PDF which contains a 3D object embedded in it.

flixcheck commented 5 years ago

I have the same issue with a huge PDF file. Any news?

cmmcneill commented 3 years ago

The attached PDF appears to parse correctly without errors for me, using the latest version in master (1.2.3). If anyone else is able to get this to happen, you can post the PDF here and I'll see if I can get a PR up to fix it.