hyzyla / pdfium

Typescript wrapper for the PDFium library, works in browser and node.js
https://pdfium.js.org
MIT License
37 stars 7 forks source link

Any way to extract all images #12

Open nitin2953 opened 2 months ago

nitin2953 commented 2 months ago

Some popular libraries don't support exporting images from pdf and other methods are not working well, Is there any way to extract all images (or jpg, png, bmp) with pdfium ?

hyzyla commented 2 months ago

Currently pdfium doesn’t have any methods to do that. But seems it not hard to implement https://stackoverflow.com/questions/72224050/get-pdf-images-in-an-array-using-pdfium-to-edit-them

nitin2953 commented 2 months ago

That would be incredibly useful for editing, replacing, compression, sharing, and other tasks

hyzyla commented 2 months ago

Would you like to contribute to the project? I think it can be some iterator to that returns objects, that are wrappers around raw FPDF_Object. Something like we have for iterating pages in PDFDocument

nitin2953 commented 2 months ago

Sorry but I'm just a frontend web developer, I know nothing about C/C++

hyzyla commented 2 months ago

No worries, I'll look into that issue later next week

hyzyla commented 2 months ago

Check this out https://pdfium.js.org/docs/extract-images-from-page

nitin2953 commented 2 months ago

Thank you again, but this is what working for me

- await fs.writeFile(`output/${index}.png`, image.data);
+ await fs.writeFile(`output/${index}.png`, Buffer.from(image.buffer));

A small question, looks like currently it is only getting rendered image not original image data. If the embedded image is a png then resulted png is similar in size but if it is a jpeg the resulting png is very large in size. idk maybe _FPDFImageObj_GetImageDataRaw can be used to get original image and Bitmap method can be used to get other types of images

hyzyla commented 2 months ago

Could you send example PDF with images?

nitin2953 commented 2 months ago

Here it is image-pdf.zip

hyzyla commented 2 months ago

You may check version 1.0.11 with the new method PDFiumImageObject.getRawImageData.


const document = await library.loadDocument(buff);

const page = document.getPage(0);
const object = page.getObject(0);

cosnt {
  data,
  width,
  height,
  filters,
} = object.getImageDataRaw();
/* 
Example output:
  {
    data: [...], // Raw uncompressed image data
    width: 100, // Image width
    height: 100, // Image height
    filters: ["DCTDecode"], // Filters/decoders used to decode the image data
  } 
*/

However, when I implemented that method, I realized that it might not be what you want. Images in PDF files are not stored as JPEG or PNG; they are stored as bitmap values compressed with some filters/decoders. When you get the raw data, you need to decompress it and then convert that bitmap into a JPEG image. You can achieve the same result using the render method and a custom render function. See this section in documentation for more details:

import sharp from 'sharp';
import { PDFiumPageRenderOptions } from '@hyzyla/pdfium';

const document = await library.loadDocument(buff);
const page = document.getPage(0);
const object = page.getObject(0);

const image = await object.render({
  render: async (options: PDFiumPageRenderOptions): Promise<Buffer> => {
    return await sharp(options.data, {
      raw: {
        width: options.width,
        height: options.height,
        channels: 4,
      },
    })
      .jpeg({ quality: 80 }) // Use JPEG format with 80% quality
      .toBuffer();
  },
});
nitin2953 commented 2 months ago

Thank you again for the solution & explanation

BTW any update on this, docs say image.data will work but I have to use Buffer.from(image.buffer)

- await fs.writeFile(`output/${index}.png`, image.data);
+ await fs.writeFile(`output/${index}.png`, Buffer.from(image.buffer));
hyzyla commented 2 months ago

@nitin2953 I’ve checked all methods that return image and all of them return data: Buffer. So it should works without Buffer.from. Could you provide more broader example?

nitin2953 commented 2 months ago

I'm using this in javascript,

const fs = require('fs').promises;
const { PDFiumLibrary } = require('@hyzyla/pdfium');

async function main() {
  const buff = await fs.readFile('test_3_with_images.pdf')
  const library = await PDFiumLibrary.init();
  const document = await library.loadDocument(buff);

  let index = 0;

  for (const page of document.pages()) {
    for (const object of page.objects()) {
      if (object.type === "image") {

        const { data: image } = await object.render({ render: "sharp" });

        // await fs.writeFile(`exported-images/${index}.png`, image.data); // NOT WORKING
        await fs.writeFile(`exported-images/${index}.png`, Buffer.from(image.buffer)); // WORKING
        index++;
      }
    }
  }

  document.destroy();
  library.destroy();
}
main();

I am getting this error while using image.data

node:internal/fs/promises:1211
    validateStringAfterArrayBufferView(data, 'data');
    ^

TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received an instance of ArrayBuffer
    at Object.writeFile (node:internal/fs/promises:1211:5)
    at main (C:\Users\nitin\Desktop\PDF\export-images.js:18:14) {
  code: 'ERR_INVALID_ARG_TYPE'
}

Node.js v22.1.0