mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

Unable to get the image of an image form element #13704

Closed MorganWaya closed 3 years ago

MorganWaya commented 3 years ago

Attach (recommended) or Link to PDF file here:

Configuration:

I'm trying to list and extract data form each element of a form created in a PDF file. I have some image elements detected as Btn (button) and without any indice about the image.

`const items = await pdfPage.getAnnotations();

for (const item of items) {

// With 'TestImage' as the name of one on my image elements
if (item.fieldName === 'TestImage') {
    console.log('TestImage', item);
}

}`

Gives:

TestImage {annotationFlags: 4, borderStyle: {…}, color: Uint8ClampedArray(3), contents: "", hasAppearance: true, …} actions: {Action: Array(1)} alternativeText: "" annotationFlags: 4 annotationType: 20 borderStyle: dashArray: [3] horizontalCornerRadius: 0 style: 1 verticalCornerRadius: 0 width: 0 checkBox: false color: Uint8ClampedArray(3) [0, 0, 0] contents: "" defaultAppearanceData: fontColor: Uint8ClampedArray(3) [0, 0, 0] fontName: "HeBo" fontSize: 12 defaultFieldValue: null fieldFlags: 65536 fieldName: "TestImage" fieldType: "Btn" fieldValue: null hasAppearance: true hidden: false id: "57R" isTooltipOnly: false modificationDate: null pushButton: true radioButton: false readOnly: false rect: (4) [383.563, 623.486, 527.808, 744.176] subtype: "Widget"

Any idea?

In front of that, I'm able to get images from regular image fields (out of Forms) without any context (the location and the size used in the PDF):

// Potential image types
const imagesObjectsTypes = [
    pdfjsLib.OPS.paintImageMaskXObject,
    pdfjsLib.OPS.paintImageMaskXObjectGroup,
    pdfjsLib.OPS.paintImageXObject, // <- 85
    pdfjsLib.OPS.paintInlineImageXObject,
    pdfjsLib.OPS.paintInlineImageXObjectGroup,
    pdfjsLib.OPS.paintImageXObjectRepeat,
    pdfjsLib.OPS.paintImageMaskXObjectRepeat,   
];

const operatorsList = await pdfPage.getOperatorList();

for (let i = 0; i < operatorsList.fnArray.length; i++) {

    const type = operatorsList.fnArray[i];

    if (imagesObjectsTypes.indexOf(type) >= 0) {

        const image = pdfPage.objs.get(operatorsList.argsArray[i][0]);

        console.log('Image detected:')
        console.log(image);
        console.log(operatorsList.argsArray[i]);
    }
}

Gives

Image detected: width: 1067, height: 600, kind: 2, data: Uint8ClampedArray(1920600)} data: Uint8ClampedArray(1920600) [ …], height: 600, kind: 2, width: 1067,

0: "img_p0_3" 1: 1067 2: 600

Thank you :)

Snuffleupagus commented 3 years ago

Attach (recommended) or Link to PDF file here:

Generally speaking, this part is always required in order for issues to be actionable/valid.

Furthermore, please see https://github.com/mozilla/pdf.js/blob/master/.github/CONTRIBUTING.md (emphasis mine):

If you are developing a custom solution, first check the examples at https://github.com/mozilla/pdf.js#learning and search existing issues. If this does not help, please prepare a short well-documented example that demonstrates the problem and make it accessible online on your website, JS Bin, GitHub, etc. before opening a new issue or contacting us in the Matrix room -- keep in mind that just code snippets won't help us troubleshoot the problem.


Image resources within an Annotation will be rendered as part of the (regular) operatorList for the page, please refer to the beginAnnotations/endAnnotations and beginAnnotation/endAnnotation operators within which such image resources will be placed (similar to the "regular" ones).

MorganWaya commented 3 years ago

Thank you for you reply.

Image resources within an Annotation will be rendered as part of the (regular) operatorList for the page, please refer to the beginAnnotations/endAnnotations and beginAnnotation/endAnnotation operators within which such image resources will be placed (similar to the "regular" ones).

I tried to get objects using these types beginAnnotation(s), endAnnotation(s) and I only have empty arrays.

This is the PDF file I used, created with Adobe Acrobat DC: genuine.pdf

And my POC trying to get images: pdf-form-image-poc.zip

Best regards.