Closed mhassan1 closed 1 year ago
@julianhille Do you have any suggestions for debugging these PDFs? If there's something unsupported about them, it would be useful to be able to pinpoint what it is. Thanks!
Have a look at the file headers most of the time it's starting null or characters like tab and space. I'm currently sick but you could send it to me any way privately and u handle the files with care and have a lock as far I'm recovered
I see that the top part of the PDF looks like this:
--F2B80151991097585D127FA
Content-Length: 884438
Content-Type: application/pdf
%PDF-1.4
That looks to me like a multipart
header. Have you ever seen that before? Is that a valid opening to a PDF?
The multipart header does not belong there. The file normally starts with a pdf version, like the last line.
Thank you, that answers the question.
I noticed that other PDF parsing libraries are more lenient about unexpected text before the header:
pdf.js
skips all text before the header: https://github.com/mozilla/pdf.js/blob/06599f487fc2e939fec4a6fd9e4b543883c7eba7/src/core/document.js#L887pdf-lib
skips all text before the header: https://github.com/Hopding/pdf-lib/blob/93dd36e85aa659a3bca09867d2d8fac172501fbe/src/core/parser/PDFParser.ts#L100I also see other issues in the past related to this, e.g. https://github.com/galkahana/HummusJS/issues/179.
According to the PDF 1.4 spec, the header line should be the first line in the file, although the appendix has a note:
Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.
I may open a pull request that makes the header line logic more lenient.
Thank you for digging that up. Pr welcome. Not sure if this should be an option like "strict" or similar and if it should be default on or off. It changes behaviour at least.
Also when copying a pdf should the text before the version header be included in a copied version? This is just a pdf lib not sure if it should do this.
Thank you, @julianhille. When can we expect a new release?
Shortly. I wanted to have a look if there are some new electron and or node versions and may include them. Wanted to do that this evening so I guess tomorrow or the day after. Fine for you?
Fine for me!
Release 3.2.0 has been published
We have noticed that PDFs generated by the following producers cannot be understood by
muhamarra
:StreamServe Communication Server 16.4.0 GA Build
macOS Version 11.2.3 (Build 20D91) Quartz PDFCo
In both cases, the PDFs load in other readers (e.g. Adobe Reader).
I do not want to post examples publicly, since they contain personal information, but I can send them privately.
To reproduce: