Closed ricktjwong closed 5 years ago
The problem stems from pdfjs itself; it assumes serial and non-concurrent access to its state, and so has single global state and will not guard against processing of PDFs from eg, multiple Promises. pdf2md-cli hence mangles this state by invoking pdfjs multiple times through the following: https://github.com/opendocsg/pdf2md/blob/3f5d71ec82eea7f6048f239aa6da52018963f8b6/lib/pdf2md-cli.js#L80-L84
Note the use of forEach
but the lack of await
on pdf2md
This bug throws two different error messages (for different PDFs)
More information: The specific document in question is
Annex VII Schedule of Specific Comm.pdf
. The order of processing seems to matter, because the error is not thrown if the file is the only one in the folder, or if certain combinations of multiple files are present in the folder.Findings: The error is thrown in the following lines in pdf.jsx:
const metadata = await pdfDocument.getMetadata()
const page = await pdfDocument.getPage(j)