opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
195 stars 39 forks source link

PDF Error #20

Closed ricktjwong closed 5 years ago

ricktjwong commented 5 years ago

This bug throws two different error messages (for different PDFs)

More information: The specific document in question is Annex VII Schedule of Specific Comm.pdf. The order of processing seems to matter, because the error is not thrown if the file is the only one in the folder, or if certain combinations of multiple files are present in the folder.

Findings: The error is thrown in the following lines in pdf.jsx:

LoneRifle commented 5 years ago

The problem stems from pdfjs itself; it assumes serial and non-concurrent access to its state, and so has single global state and will not guard against processing of PDFs from eg, multiple Promises. pdf2md-cli hence mangles this state by invoking pdfjs multiple times through the following: https://github.com/opendocsg/pdf2md/blob/3f5d71ec82eea7f6048f239aa6da52018963f8b6/lib/pdf2md-cli.js#L80-L84

Note the use of forEach but the lack of await on pdf2md