pdfcpu / pdfcpu

A PDF processor written in Go.
http://pdfcpu.io/
Apache License 2.0
6.74k stars 465 forks source link

PDF/A conversion/validation #34

Open hhrutter opened 6 years ago

hhrutter commented 6 years ago

As requested by @tcurdt

It would be great if pdfcpu could be used to convert to PDF/A

https://en.wikipedia.org/wiki/PDF/A

magegu commented 4 years ago

hi there,

i'm interested in this feature, too - and I would like to add "merging" to the PFA/A "wishlist" tracked in this issue.

As a first test I merged two valid PFA/A files (pdfcpu: v0.3.4 dev) and run this validator on the result. The feedback was:

Validating file "out.pdf" for conformance level pdfa-2u
The separator before 'endstream' must be an EOL.
The separator before 'endstream' must be an EOL.
The separator before 'endstream' must be an EOL.
The separator before 'endstream' must be an EOL.
The document does not conform to the requested standard.
The file format (header, trailer, objects, xref, streams) is corrupted.
The document does not conform to the PDF/A-2u standard.

I would like to contribute but i don't know a lot about the pdf spec to add valuable info at this point. Is there anything specific you can point me to where to start to debug these issues @hhrutter?

hhrutter commented 4 years ago

As soon as pdfcpu has support for PDF/A merging should work out of the box. Basically you have to make a pass over the cross reference table after reading the file where you check/enforce compliance with PDF/A for all elements which may fail or succeed for any given PDF input file. You would need the PDF/A spec for that. I don't know if that is for free because I have not had the time to look into this because I was busy otherwise.

OrganicChem commented 4 years ago

Hi there, any update on this issue?

I'm trying to run pdfcpu within a NodeJS environment, and if a file is invalid, I would like some guidance on how to convert the PDF to a valid one instream.

hhrutter commented 4 years ago

Hello!

This issue remains in the pipeline while I am busy working on other issues and improving pdfcpu's api. I am doing what I can with given resources. There is always the Sponsor ❤️ button at the top of pdfcpu's github landing page if you like my work and maybe feel like boosting the priority of a specific issue/feature.

I am assuming when you say "invalid" you mean pdfcpu validation is complaining. Without knowing the details and further assuming you are doing this in your Go backend, you need to establish a PDFContext in memory and then call Optimize on it. pdfpu optimize is what comes closest to a repair facility in other PDF tools.

As always if you're stuck - even after optimization - I am happy to look into any representative sample PDF you can provide and see If we can improve optimization or relax validation.

If you do want to follow up on this specific issue you just raised please open a separate issue as I can't see any relation to an upcoming PDF/A support in pdfcpu.

Thank you!

hhrutter commented 4 years ago

Thanks for sharing, I'll check them out!

hhrutter commented 4 years ago

I had to delete your problematic files because some of them include personal information + they are not regarding this issue. Rest assured I am analysing them but please refrain from posting files with sensible information here. It's safer to submit such files to hhrutter@gmail.cm

Thank you!

OrganicChem commented 4 years ago

I've managed to get 3 out of the 4 bad files (file 2 remains troublesome) to merge using the following workaround:

Since I'm working mainly in a NodeJS environment, I basically split the problematic files using Hummus, a NodeJS module for Creating, Parsing and Modifying PDF files and streams. After splitting, 3 of the 4 bad files were merged successfully with pdfcpu.

The splitting process is quite fast - eg: a 3K page PDF took less than 5 seconds! I just wanted to share this.

hhrutter commented 4 years ago

Sounds good!

In any case I already have fixes for all four of your bad files. Will be part of the upcoming release.

OrganicChem commented 4 years ago

Any timeline on the next release?

hhrutter commented 4 years ago

This weekend.

OrganicChem commented 4 years ago

Thank you so much Horst for resolving those bad files during a merge process with the new version 0.3.5!

This is much appreciated, I really mean that!

ghost commented 3 years ago

Is there anything planned regarding PDF/A validation as well as conversion (to PDF/A) support in any upcoming release?

hhrutter commented 3 years ago

Once pdfcpu goes beta this will be tackled. Please keep checking #235.