Merging some pdfs results in ExternalDocument returning error

rernens commented 4 years ago

Hi,

I am using your library for merging multiples pdfs in one after uploading various documents from aa web application. While this is working without problems most of the time, certain documents fail the EternalDocument stage, without a specific error being returned. The process is the following :

documents are uploaded as data-uri from a web application
if documents are images they are converted to pdf before being uploaded,
when uploaded pdfs are stored as blob ( base64 ) in a database
when all documents have been successfully uploaded they are merged in a single pdf that is stored on disk;

The merge process, extract each pdf from the database blob, turn it back to data-uri and convert it to a buffer to pass it to ExternalDocument before to turn it in a recognized pdfjs pdf and add it to the merged pdf.

async function _mergeFiles(files, fileName) {

  var pdf = requireNode('pdfjs');

  var fs = requireNode('fs');

  var toBuffer = requireNode('data-uri-to-buffer');

  try {

    var doc = new pdf.Document();

    for (var i = 0; i < files.length; i++) {

      var file = files[i];
      var src,
        ext;

      src = file.dataUri.toBuffer();

      var dataUri = 'data:application/pdf;base64,' + src.toString('base64');

      src = toBuffer(dataUri);

      ext = new pdf.ExternalDocument(src);

      doc.setTemplate(ext);

      doc.addPagesOf(ext);

    }

    var writeStream = doc.pipe(fs.createWriteStream(fileName));

    await doc.end();

    var writeStreamClosedPromise = new Promise((resolve, reject) => {

      try {

        writeStream.on('close', () => resolve())

      } catch (e) {

        reject({file: file.name, sequence: file.sequence, reason: e});

      }

    })

    src = null;
    ext = null;
    doc = null;
    dataUri = null;

    return writeStreamClosedPromise;

  } catch (e) {

    reject({file: file.name, sequence: file.sequence, reason: e});

  }

}

this process works fine most of the time but some pdfs won't pass the ExternalDocument stage.

ACTU 04-20.pdf CCF_000001.pdf LONGY_JULIE_Complément dossier_LONGY Julie.pdf LONGY_JULIE_DAMA_LONGY Julie.pdf LONGY_JULIE_detail dossier_longy julie.pdf

the above files are examples of pdfs that won't pass the External Document step.

I am using the latest version of pdfjs : v2.3.7

thanks for your help

rkusa commented 4 years ago

Thanks for the reports. I went through the PDFs and made the Lexer more forgiving on certain places. They should work now, except the following

The PDF LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf has a syntax error. So not fix here from my side I am afraid.

% java -jar test/preflight-app-2.0.19.jar LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf
The file LONGY_JULIE_Complement.dossier_LONGY.Julie.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 20:0 points to wrong object: 19:0

Same for LONGY_JULIE_DAMA_LONGY.Julie.pdf

% java -jar test/preflight-app-2.0.19.jar LONGY_JULIE_DAMA_LONGY.Julie.pdf
The file LONGY_JULIE_DAMA_LONGY.Julie.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 11:0 points to wrong object: 10:0

In case the PDFs contain personal data, feel free to delete them again.

rernens commented 4 years ago

@rkusa thanks for this fast response.

Before I test the code could you check the attached pdf that comes directly from Adobe that fails in the same way, to see if it is a different problem or the same.

Bienvenue.pdf

rkusa commented 4 years ago

@rernens This PDF fails with a different error. I've added it to my TODO to look into it. Regarding the syntax errors of the other PDFs, the following issue has the same problem https://github.com/rkusa/pdfjs/issues/211#issuecomment-632661935

If it ends up coming up very often, I might change my mind and look into it myself. However, until now, my deployment of pdfjs never encountered that syntax error before - but merging PDFs is also not my main use-case.

rernens commented 4 years ago

@rkusa

Version 2.3.8 fixes thee issue with some documents but not all. But you know that. Thanks for your help so far. Fixing it globally would be a must for us.

rernens commented 4 years ago

@rkusa

hI Markus, 2.3.8 fixes some issues and as a workaround at this stage for the pdfs that produce an error when parsing them through the Lexer we save them as separate documents.

But we are experiencing another case where a pdf passe the Lexer stage without error and gets merged but the page is empty.

The first attached pdf is the result of the merge of two pdfs : a summary pdf and the customer uploaded pdf. As you can see, the second page is white.

Thee second attached pdf is the pdf that was added as second page to the above merged pdf. MOREAU_CINDY_000000022_201912_20200528161813.pdf 000001-actualisation decembre 2019.pdf

thanks for your help

rernens commented 4 years ago

@rkusa

Hi Markus. White pages are generated by protected pdfs ! No parsing error is generated. Using qpdf to unprotect the pdf before merging it fixes the problem. But I a still facing numerous parsing errors that will prevent pdfjs to load External document despite Acrobat opening them without noticing any parsing error. Hope you can do something about that.

rkusa commented 4 years ago

I'll try to look into the additional errors once I find the time for it. Since merging existing PDFs was never intended to be the main use-case of pdfjs, please don't expect it to be as permissive to different PDF features and syntax variants as e.g. Acrobat Reader. If your main use-case is to merge PDFs, I have to honestly say that pdfjs might not be the best tool for the job 😕

rernens commented 4 years ago

@rkusa Hi Markus. Even if this was not the main use-case of pdfjs, so far it has proven to be the lighter weight et most reliable one for merging pdfs altogether. Tried many libraries and yours remains unmatchable so far even if some parsing issues remain. Thanks for that.

otroboe commented 4 years ago

Hello!

I don't know if it would help, but I needed to merge PDF too, and this library saved me (so thanks a lot @rkusa, very nice job!).

My PDF files are generated with puppeeter and others are stored in AWS. I work with "Buffers" only, and it seems it's what you need @rernens too.

Here's my method:

const {Document, ExternalDocument} = require('pdfjs');

/**
 * Merge multiple PDF buffers into one buffer
 *
 * @param {Array} bufferList
 * @return {Promise}
 */
const mergeBufferPdfs = (bufferList) => {
    if (bufferList.length === 0) {
        throw new Error('You must pass buffers to merge a PDF');
    }

    const mergeDocument = new Document();
    let externalDocument;

    bufferList.forEach((buffer) => {
        externalDocument = new ExternalDocument(buffer);
        mergeDocument.addPagesOf(externalDocument);
    });

    return mergeDocument.asBuffer();
};

So far so good, I like when it's simple. I hope it can help somehow.

rkusa / pdfjs

Merging some pdfs results in ExternalDocument returning error #214