rkusa / pdfjs

A Portable Document Format (PDF) generation library targeting both the server- and client-side.
MIT License
774 stars 142 forks source link

Combined PDF size is bigger than the originals #314

Open wildhart opened 1 year ago

wildhart commented 1 year ago

I'm using pdfjs to combine two PDF files. Each file is about 3 MB, but the combined pdf is 40 MB.

One of the files is a scanned document where each page is actually an image. Are the images getting upscaled as they are added to the combined PDF? If so, is there a way I can prevent this?

    const outputDoc = new pdfjs.Document();
    for (const url of urls) {
        const buffer = await downloadUrlAsBuffer(url); // my own code
        const doc = new pdfjs.ExternalDocument(buffer);
        outputDoc.addPagesOf(doc);
    }
    const buffer = outputDoc.asBuffer();

image

In a perfect world I would be able to specify a maximum DPI - if an image is higher resolution then the image is downscaled, otherwise the original image is unmodified.

I'll send the two files to you by email...

wildhart commented 9 months ago

@rkusa Have you been able to look into this yet?

I've done some more experiments with different files. The source file is a file I generated with another library jspdf containing text, lines and images.

The full file contains 5 pages with a header image on each page plus two pages with photos.

Here's a summary of my findings:

Description Original Orig Size Converted Converted Size
Full orig.pdf 2.15 Mb converted.pdf 10.7 Mb
No headers orig no headers.pdf 2.09 Mb converted no headers.pdf 10.4 Mb
No images orig no images.pdf 71.8 kB converted no images.pdf 341 kB

Why is PDFjs increasing the output file size so much?

wildhart commented 9 months ago

Just FYI, I've moved way from using pdfjs for merging PDFs, due to this issue with excessive file sizes, and also (#312) where certain PDF files throw errors when they are merged.

Instead I'm using pdf-lib which is really easy to use to copy pages from one PDF to another, and it doesn't have any problems with the files we've provided in #312 which throw errors in pdfjs, and the output file size is never bigger than the original files. It also seems a bit faster.

I'm still using pdfjs to generate PDF from html, but then I use pdf-lib to combine that with other PDF files.