Croping vector based PDF files

msageryd commented 5 months ago

If I'm correctly informed, PDFium seems to be able to crop vector based PDF pages and output a smaller PDF.

I need to embed multiple smaller parts of a PDF page in another PDF. I cannot use CSS clipping or viewBox, since this will just adjust the visual part. Behind the sceenes the complete PDF page will be embedded multiple times and blow up the size of my generated PDF.

Would it be possible to expose this cropping functionality in pdfium-cli?

Apparently, this should be possible with pdf-lib (https://pdf-lib.js.org/), but I'd prefer the speed of PDFium for this.

jerbob92 commented 5 months ago

PDFium seems to be able to crop vector based PDF pages and output a smaller PDF.

If you can indicate how you think that should work I could look into it.

go-pdfium, the library behind pdfium-cli, contains all functionality that pdfium exposes (with some exceptions), so if it's possible with PDFium it should not be that much work.

msageryd commented 5 months ago

Here is a vector based PDF and a vector based crop from this PDF. The crop was made with the pdf-lib Node library as per the following (code from ChatGPT):

const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

async function cropPdf(inputPath, outputPath, cropArea) {
  // Step 1: Load the existing PDF
  const existingPdfBytes = fs.readFileSync(inputPath);
  const pdfDoc = await PDFDocument.load(existingPdfBytes);

  // Step 2: Get the first page to crop (adjust if you need a different page)
  const [existingPage] = pdfDoc.getPages();

  // Step 3: Create a new PDF document
  const newPdfDoc = await PDFDocument.create();

  // Step 4: Define the crop area (adjust as needed)
  const { x, y, width, height } = cropArea;

  // Adjust the y-coordinate to account for PDF coordinate system
  const adjustedY = existingPage.getHeight() - y - height;

  // Step 5: Create a new page in the new PDF with the size of the crop area
  const newPage = newPdfDoc.addPage([width, height]);

  // Step 6: Embed the page from the existing PDF into the new PDF
  const embeddedPages = await newPdfDoc.embedPdf(existingPdfBytes, [0]);
  const embeddedPage = embeddedPages[0];

  // Step 7: Draw the cropped area of the embedded page onto the new page
  newPage.drawPage(embeddedPage, {
    x: -x,
    y: -adjustedY,
    width: existingPage.getWidth(),
    height: existingPage.getHeight(),
  });

  // Step 8: Save the cropped PDF
  const croppedPdfBytes = await newPdfDoc.save();
  fs.writeFileSync(outputPath, croppedPdfBytes);
}

// Define the crop area (example values, adjust as needed)
const cropArea = {
  x: 100, // X-coordinate of the crop area origin
  y: 100, // Y-coordinate of the crop area origin
  width: 400, // Width of the crop area
  height: 400, // Height of the crop area
};

// Crop the PDF and save the result
cropPdf('./blueprint_full.pdf', './blueprint_crop.pdf', cropArea)
  .then(() => console.log('PDF cropped successfully'))
  .catch((err) => console.error('Error cropping PDF:', err));

blueprint_crop.pdf blueprint_full.pdf

jerbob92 commented 5 months ago

But that's pdf-lib, not pdfium.

msageryd commented 5 months ago

Yes, I attached the code as inspiration =) pdf-lib is all JavaScript. I'd rather have this functionality natively via PDFium.

ChatGPT actually gave me the code for using PDFium from Node, but this is a bit to deep for me. I haven't tried the code.


#include <nan.h>
#include "public/fpdfview.h"

void CropPDF(const Nan::FunctionCallbackInfo<v8::Value>& info) {
  if (info.Length() < 3) {
    Nan::ThrowTypeError("Wrong number of arguments");
    return;
  }

  v8::String::Utf8Value inputPath(info[0]->ToString());
  v8::String::Utf8Value outputPath(info[1]->ToString());
  v8::Local<v8::Object> cropArea = info[2]->ToObject();

  double x = cropArea->Get(Nan::New("x").ToLocalChecked())->NumberValue();
  double y = cropArea->Get(Nan::New("y").ToLocalChecked())->NumberValue();
  double width = cropArea->Get(Nan::New("width").ToLocalChecked())->NumberValue();
  double height = cropArea->Get(Nan::New("height").ToLocalChecked())->NumberValue();

  FPDF_InitLibrary();

  FPDF_DOCUMENT doc = FPDF_LoadDocument(*inputPath, nullptr);
  if (!doc) {
    Nan::ThrowError("Failed to load PDF document");
    return;
  }

  FPDF_PAGE page = FPDF_LoadPage(doc, 0);
  if (!page) {
    FPDF_CloseDocument(doc);
    Nan::ThrowError("Failed to load PDF page");
    return;
  }

  double pageWidth = FPDF_GetPageWidth(page);
  double pageHeight = FPDF_GetPageHeight(page);

  FPDF_PAGEOBJECT crop = FPDFPageObj_NewRect(x, pageHeight - y - height, width, height);
  FPDFPageObj_SetFillColor(crop, 255, 255, 255, 255);
  FPDFPage_InsertObject(page, crop);

  FPDF_FFLDrawPageBitmap(page, bitmap, 0, 0, bitmapWidth, bitmapHeight, 0, 0);

  FPDF_SaveAsCopy(doc, outputPath, FPDF_REMOVE_SECURITY);

  FPDF_ClosePage(page);
  FPDF_CloseDocument(doc);
  FPDF_DestroyLibrary();
}

NAN_MODULE_INIT(Init) {
  NAN_EXPORT(target, CropPDF);
}

NODE_MODULE(pdf_cropper, Init)

jerbob92 commented 5 months ago

Yeah but you said

If I'm correctly informed, PDFium seems to be able to crop vector based PDF pages and output a smaller PDF.

So I was under the assumption you had some idea of how this should work in PDFium, not pdf-lib.

The PDFium code you just posted does not crop out a part of a page, it renders a part of a page and then put that bitmap inside a new PDF. If your PDFs are vectors, rendering it to an image might make the PDF bigger than just using viewBox cropping.

msageryd commented 5 months ago

I'm sorry if I misunderstood, and took Chat's word as true. I got perfect help with pdf-lib so I thought Chat knew what it was talkning about. Here is what Chat said about this:

PDFium is a powerful library for working with PDFs, and it can indeed be used to manipulate PDFs while maintaining their vector properties. However, to crop a PDF page using PDFium and save it as a new, smaller PDF, we need to use its API to perform the necessary operations. This will involve creating a C++ application or a Node.js native addon using PDFium.

jerbob92 commented 5 months ago

ChatGPT is a master in mixing the truth with lies :cry:

As far as I know, PDFium does not have this capability. The way you can crop pages is by:

Using viewbox
Rendering a part of the page and inserting that into a new PDF

You could probably write some code to get a vector object from a page, then manipulate the vector to be cropped, and insert it in a new PDF, but it won't be simple.

msageryd commented 5 months ago

That's true, but I find GTP 4o to be quite a bit more reliable. I gave Chat a new chance. Got new code examples where FPDF_RenderPageBitmapWithMatrixis not used. Instead it looks like this. Do you recognize these commands or is it another hallucination?

// Create a page object from the input page
    FPDF_CLIPPATH clipPath = FPDF_CreateClipPath(x, adjustedY, width, height);
    FPDFPage_SetClipPath(inputPage, clipPath);
    FPDF_CopyPage(outputDoc, outputDoc, 0, 0);

    // Save the new document
    FPDF_SaveAsCopy(outputDoc, outputPath, 0);

jerbob92 commented 5 months ago

I think a clip path is very similar to using a viewbox.

 * Clip the page content, the page content that outside the clipping region
 * become invisible.

msageryd commented 5 months ago

Seems likely, but I'm very interested of getting rid of the invisible parts completely, i.e. get a smaller croped pdf as I can do with pdf-lib.

I'm generating PDF reports with embedded croped blueprints. When I embed pdfs with viewbox, the original pdf is still embedded in my new pdf. Since I'm embedding many (sometimes hundreds) of crops my output PFD blows up to unreasonable size. I.e. hundreds of full size PDF files are embedded in my PDF report.

msageryd commented 5 months ago

Hm, I might have misunderstood. Do you mean that SetClipPath in PDFium does not actually get rid of the invisible parts, but only hides them?

jerbob92 commented 5 months ago

Seems likely, but I'm very interested of getting rid of the invisible parts completely, i.e. get a smaller croped pdf as I can do with pdf-lib.

Are you sure that's what pdf-lib does? Because your cropped PDF is bigger in size than the original and when inspected with iText RUPS, it just embeds the original page XObject and then clips it, so not very different from a clip path or viewbox...

Hm, I might have misunderstood. Do you mean that SetClipPath in PDFium does not actually get rid of the invisible parts, but only hides them?

Correct, just like the viewbox

msageryd commented 5 months ago

I just noticed that. It seems as this can be mitigated afterwards by optimizing with GhostScript. GS seems to be able to get rid of the invisible parts.

blueprint_full.pdf = 161 KB blueprint_crop.pdf = 306 KB optimized crop = 22 KB

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=cropped_optimized.pdf blueprint_crop.pdf

msageryd commented 5 months ago

Here is some potentially GPT hallucinations. I asked if PDFium was able to to this kind of optimization.

Is this a real or imaginary command?

// Optimize the new document to remove unused objects
    FPDFPage_RemoveClipRect(outputPage);

jerbob92 commented 5 months ago

I just noticed that. It seems as this can be mitigated afterwards by optimizing with GhostScript. GS seems to be able to get rid of the invisible parts.

Then probably the same will work when using clippaths in pdfium. Be aware that GS is AGPL licensed.

Here is some potentially GPT hallucinations. I asked if PDFium was able to to this kind of optimization. Is this a real or imaginary command?

There is no FPDFPage_RemoveClipRect indeed.

msageryd commented 5 months ago

I found some other optimization tools under Apache 2.0 licenses. I will look into them. But I'd still need to prepare the PDF with the cliprect. This might not be a resource demanding operation, so the move to PDFium just for this might be moot.

On my M2 Max it takes about 100ms to "crop" a vector pdf with pdf-lib. Then another 50ms to optimize with GhostScript. I'll try the other tools and see what they can give. One of them is written in Go.

jerbob92 commented 5 months ago

Actual cropping is quite complex, I suspect it works with GS because it always completely rebuilds the PDF, I think that's how GS works.

msageryd commented 5 months ago

Thank you for your input. Great as usual. I think I have stumbled on a much better solution, which I'll try to implement. I'll close this issue now.

The new solution: It should be possible to embed an image or pdf once in a PDF and then reference the same file multiple times with different viewBoxes. Usually I'm only dealing with 1-5 original blueprints, but I need to presend hundreds of different crops from these files.

klippa-app / pdfium-cli

Croping vector based PDF files #40