Open rernens opened 4 years ago
Thanks for the reports. I went through the PDFs and made the Lexer more forgiving on certain places. They should work now, except the following
The PDF LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf
has a syntax error. So not fix here from my side I am afraid.
% java -jar test/preflight-app-2.0.19.jar LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf
The file LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 20:0 points to wrong object: 19:0
Same for LONGY_JULIE_DAMA_LONGY.Julie.pdf
% java -jar test/preflight-app-2.0.19.jar LONGY_JULIE_DAMA_LONGY.Julie.pdf
The file LONGY_JULIE_DAMA_LONGY.Julie.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 11:0 points to wrong object: 10:0
In case the PDFs contain personal data, feel free to delete them again.
@rkusa thanks for this fast response.
Before I test the code could you check the attached pdf that comes directly from Adobe that fails in the same way, to see if it is a different problem or the same.
@rernens This PDF fails with a different error. I've added it to my TODO to look into it. Regarding the syntax errors of the other PDFs, the following issue has the same problem https://github.com/rkusa/pdfjs/issues/211#issuecomment-632661935
If it ends up coming up very often, I might change my mind and look into it myself. However, until now, my deployment of pdfjs never encountered that syntax error before - but merging PDFs is also not my main use-case.
@rkusa
Version 2.3.8 fixes thee issue with some documents but not all. But you know that. Thanks for your help so far. Fixing it globally would be a must for us.
@rkusa
hI Markus, 2.3.8 fixes some issues and as a workaround at this stage for the pdfs that produce an error when parsing them through the Lexer we save them as separate documents.
But we are experiencing another case where a pdf passe the Lexer stage without error and gets merged but the page is empty.
The first attached pdf is the result of the merge of two pdfs : a summary pdf and the customer uploaded pdf. As you can see, the second page is white.
Thee second attached pdf is the pdf that was added as second page to the above merged pdf. MOREAU_CINDY_000000022_201912_20200528161813.pdf 000001-actualisation decembre 2019.pdf
thanks for your help
@rkusa
Hi Markus. White pages are generated by protected pdfs ! No parsing error is generated. Using qpdf to unprotect the pdf before merging it fixes the problem. But I a still facing numerous parsing errors that will prevent pdfjs to load External document despite Acrobat opening them without noticing any parsing error. Hope you can do something about that.
I'll try to look into the additional errors once I find the time for it. Since merging existing PDFs was never intended to be the main use-case of pdfjs
, please don't expect it to be as permissive to different PDF features and syntax variants as e.g. Acrobat Reader. If your main use-case is to merge PDFs, I have to honestly say that pdfjs
might not be the best tool for the job 😕
@rkusa Hi Markus. Even if this was not the main use-case of pdfjs, so far it has proven to be the lighter weight et most reliable one for merging pdfs altogether. Tried many libraries and yours remains unmatchable so far even if some parsing issues remain. Thanks for that.
Hello!
I don't know if it would help, but I needed to merge PDF too, and this library saved me (so thanks a lot @rkusa, very nice job!).
My PDF files are generated with puppeeter
and others are stored in AWS. I work with "Buffers" only, and it seems it's what you need @rernens too.
Here's my method:
const {Document, ExternalDocument} = require('pdfjs');
/**
* Merge multiple PDF buffers into one buffer
*
* @param {Array} bufferList
* @return {Promise}
*/
const mergeBufferPdfs = (bufferList) => {
if (bufferList.length === 0) {
throw new Error('You must pass buffers to merge a PDF');
}
const mergeDocument = new Document();
let externalDocument;
bufferList.forEach((buffer) => {
externalDocument = new ExternalDocument(buffer);
mergeDocument.addPagesOf(externalDocument);
});
return mergeDocument.asBuffer();
};
So far so good, I like when it's simple. I hope it can help somehow.
Hi,
I am using your library for merging multiples pdfs in one after uploading various documents from aa web application. While this is working without problems most of the time, certain documents fail the EternalDocument stage, without a specific error being returned. The process is the following :
The merge process, extract each pdf from the database blob, turn it back to data-uri and convert it to a buffer to pass it to ExternalDocument before to turn it in a recognized pdfjs pdf and add it to the merged pdf.
this process works fine most of the time but some pdfs won't pass the ExternalDocument stage.
ACTU 04-20.pdf CCF_000001.pdf LONGY_JULIE_Complément dossier_LONGY Julie.pdf LONGY_JULIE_DAMA_LONGY Julie.pdf LONGY_JULIE_detail dossier_longy julie.pdf
the above files are examples of pdfs that won't pass the External Document step.
I am using the latest version of pdfjs : v2.3.7
thanks for your help