modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
2.02k stars 377 forks source link

PDF downloaded through request unreadable. From file it is readable. #163

Closed radboudp closed 3 months ago

radboudp commented 6 years ago

So, I am intend to use pdf2json to test my pdf generator service within cucumberjs. When I read the expected pdf from file I can parse the PDF. No problem. When I obtain the generated pdf from the service, it is not possible to parse the PDF. After some investigation I found the problem. The Buffer returned for the file has the same amount of bytes allocated as the number of bytes in the PDF. The Buffer created by the request lib to download the PDF from the service is larger then the number of bytes put into it. This seems to be a problem for pdf2json or the underlaying pdf parser:

    { parserError: 'An error occurred while parsing the PDF: bad XRef entry' }).

For file pdf:

    pdfBuffer.buffer:  ArrayBuffer { byteLength: 1004 }
    pdfBuffer.length:  1004

For downloaded pdf:

    pdfBuffer.buffer:  ArrayBuffer { byteLength: 8192 }
    pdfBuffer.length:  1004

I work around this problem by creating a new buffer of the correct length and copying the data into it. Then it works.

    let bufferNew = Buffer.alloc(pdfBuffer.length);
    pdfBuffer.copy(bufferNew);

It seems to me that the buffer is parsed too far...

nettad commented 5 years ago

I ran into the exact same issue. @radboudp thanks for the workaround.

jbdemonte commented 5 years ago

Got exactly the same bug, work from a physical file, does't from a stream @radboudp thanks for the workaround.

jonaskello commented 3 years ago

+1 thanks @radboudp