Open nitin2953 opened 2 months ago
Currently pdfium doesn’t have any methods to do that. But seems it not hard to implement https://stackoverflow.com/questions/72224050/get-pdf-images-in-an-array-using-pdfium-to-edit-them
That would be incredibly useful for editing, replacing, compression, sharing, and other tasks
Would you like to contribute to the project? I think it can be some iterator to that returns objects, that are wrappers around raw FPDF_Object. Something like we have for iterating pages in PDFDocument
Sorry but I'm just a frontend web developer, I know nothing about C/C++
No worries, I'll look into that issue later next week
Check this out https://pdfium.js.org/docs/extract-images-from-page
Thank you again, but this is what working for me
- await fs.writeFile(`output/${index}.png`, image.data);
+ await fs.writeFile(`output/${index}.png`, Buffer.from(image.buffer));
A small question, looks like currently it is only getting rendered image not original image data.
If the embedded image is a png then resulted png is similar in size but if it is a jpeg the resulting png is very large in size.
idk maybe _FPDFImageObj_GetImageDataRaw
can be used to get original image and Bitmap
method can be used to get other types of images
Could you send example PDF with images?
Here it is image-pdf.zip
You may check version 1.0.11 with the new method PDFiumImageObject.getRawImageData.
const document = await library.loadDocument(buff);
const page = document.getPage(0);
const object = page.getObject(0);
cosnt {
data,
width,
height,
filters,
} = object.getImageDataRaw();
/*
Example output:
{
data: [...], // Raw uncompressed image data
width: 100, // Image width
height: 100, // Image height
filters: ["DCTDecode"], // Filters/decoders used to decode the image data
}
*/
However, when I implemented that method, I realized that it might not be what you want. Images in PDF files are not stored as JPEG or PNG; they are stored as bitmap values compressed with some filters/decoders. When you get the raw data, you need to decompress it and then convert that bitmap into a JPEG image. You can achieve the same result using the render method and a custom render function. See this section in documentation for more details:
import sharp from 'sharp';
import { PDFiumPageRenderOptions } from '@hyzyla/pdfium';
const document = await library.loadDocument(buff);
const page = document.getPage(0);
const object = page.getObject(0);
const image = await object.render({
render: async (options: PDFiumPageRenderOptions): Promise<Buffer> => {
return await sharp(options.data, {
raw: {
width: options.width,
height: options.height,
channels: 4,
},
})
.jpeg({ quality: 80 }) // Use JPEG format with 80% quality
.toBuffer();
},
});
Thank you again for the solution & explanation
BTW any update on this, docs say image.data
will work but I have to use Buffer.from(image.buffer)
- await fs.writeFile(`output/${index}.png`, image.data); + await fs.writeFile(`output/${index}.png`, Buffer.from(image.buffer));
@nitin2953 I’ve checked all methods that return image and all of them return data: Buffer
. So it should works without Buffer.from
. Could you provide more broader example?
I'm using this in javascript,
const fs = require('fs').promises;
const { PDFiumLibrary } = require('@hyzyla/pdfium');
async function main() {
const buff = await fs.readFile('test_3_with_images.pdf')
const library = await PDFiumLibrary.init();
const document = await library.loadDocument(buff);
let index = 0;
for (const page of document.pages()) {
for (const object of page.objects()) {
if (object.type === "image") {
const { data: image } = await object.render({ render: "sharp" });
// await fs.writeFile(`exported-images/${index}.png`, image.data); // NOT WORKING
await fs.writeFile(`exported-images/${index}.png`, Buffer.from(image.buffer)); // WORKING
index++;
}
}
}
document.destroy();
library.destroy();
}
main();
I am getting this error while using image.data
node:internal/fs/promises:1211
validateStringAfterArrayBufferView(data, 'data');
^
TypeError [ERR_INVALID_ARG_TYPE]: The "data" argument must be of type string or an instance of Buffer, TypedArray, or DataView. Received an instance of ArrayBuffer
at Object.writeFile (node:internal/fs/promises:1211:5)
at main (C:\Users\nitin\Desktop\PDF\export-images.js:18:14) {
code: 'ERR_INVALID_ARG_TYPE'
}
Node.js v22.1.0
Some popular libraries don't support exporting images from pdf and other methods are not working well, Is there any way to extract all images (or jpg, png, bmp) with pdfium ?