`Unable to start parsing PDF file` error for PDFs from certain producers

julianhille / MuhammaraJS

Muhammara a node module with c/cpp bindings to modify PDF with js for node or electron (based/replacement on/of galkhana/hummusjs)

Other

231 stars 46 forks source link

`Unable to start parsing PDF file` error for PDFs from certain producers #211

Closed mhassan1 closed 1 year ago

mhassan1 commented 2 years ago

We have noticed that PDFs generated by the following producers cannot be understood by muhamarra:

StreamServe Communication Server 16.4.0 GA Build
macOS Version 11.2.3 (Build 20D91) Quartz PDFCo

In both cases, the PDFs load in other readers (e.g. Adobe Reader).

I do not want to post examples publicly, since they contain personal information, but I can send them privately.

To reproduce:

muhammara.createReader('input.pdf') // TypeError: Unable to start parsing PDF file

muhammara
  .createWriter('output.pdf')
  .appendPDFPagesFromPDF('input.pdf') // TypeError: unable to append page, make sure it's fine

mhassan1 commented 2 years ago

@julianhille Do you have any suggestions for debugging these PDFs? If there's something unsupported about them, it would be useful to be able to pinpoint what it is. Thanks!

julianhille commented 2 years ago

Have a look at the file headers most of the time it's starting null or characters like tab and space. I'm currently sick but you could send it to me any way privately and u handle the files with care and have a lock as far I'm recovered

mhassan1 commented 2 years ago

I see that the top part of the PDF looks like this:

--F2B80151991097585D127FA
Content-Length: 884438
Content-Type: application/pdf

%PDF-1.4

That looks to me like a multipart header. Have you ever seen that before? Is that a valid opening to a PDF?

julianhille commented 2 years ago

The multipart header does not belong there. The file normally starts with a pdf version, like the last line.

mhassan1 commented 2 years ago

Thank you, that answers the question.

I noticed that other PDF parsing libraries are more lenient about unexpected text before the header:

pdf.js skips all text before the header: https://github.com/mozilla/pdf.js/blob/06599f487fc2e939fec4a6fd9e4b543883c7eba7/src/core/document.js#L887
pdf-lib skips all text before the header: https://github.com/Hopding/pdf-lib/blob/93dd36e85aa659a3bca09867d2d8fac172501fbe/src/core/parser/PDFParser.ts#L100
Adobe Reader doesn't seem to mind

I also see other issues in the past related to this, e.g. https://github.com/galkahana/HummusJS/issues/179.

According to the PDF 1.4 spec, the header line should be the first line in the file, although the appendix has a note:

Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.

I may open a pull request that makes the header line logic more lenient.

julianhille commented 2 years ago

Thank you for digging that up. Pr welcome. Not sure if this should be an option like "strict" or similar and if it should be default on or off. It changes behaviour at least.

Also when copying a pdf should the text before the version header be included in a copied version? This is just a pdf lib not sure if it should do this.

mhassan1 commented 1 year ago

Thank you, @julianhille. When can we expect a new release?

julianhille commented 1 year ago

Shortly. I wanted to have a look if there are some new electron and or node versions and may include them. Wanted to do that this evening so I guess tomorrow or the day after. Fine for you?

mhassan1 commented 1 year ago

Fine for me!

julianhille commented 1 year ago

Release 3.2.0 has been published