mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

Unexpected value in getOperatorList().argsArray[] for op 44 (showText) #10939

Closed hughsw closed 5 years ago

hughsw commented 5 years ago

Attach (recommended) or Link to PDF file here: DesignerSpec-SG-009-01.pdf

Configuration:

Steps to reproduce the problem:

  1. Save the PDF file as DesignerSpec-SG-009-01.pdf
  2. Run the following script:
    
    #!/usr/bin/env node

const pdfjsLib = require('pdfjs-dist');

const go = async () => { const filename = 'DesignerSpec-SG-009-01.pdf'; console.log('filename:', filename); const pdf = await pdfjsLib.getDocument(filename).promise; const page = await pdf.getPage(1); const operatorList = await page.getOperatorList();

const index = 11; console.log('op:', operatorList.fnArray[index]); const [args0] = operatorList.argsArray[index]; args0.forEach(item => console.log(JSON.stringify(item))); return 0; };

go() .catch(err => { console.log('err:', err); return 1; }) .then(code => process.exit(code))


What is the expected behavior?  All items in `args0` will be objects.

What went wrong?  The third item in the list is a number.

filename: DesignerSpec-SG-009-01.pdf op: 44 {"fontChar":"","unicode":"I","accent":null,"width":267,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":"t","accent":null,"width":347,"isSpace":false,"isInFont":true} 12.9 {"fontChar":"","unicode":"e","accent":null,"width":503,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":"m","accent":null,"width":813,"isSpace":false,"isInFont":true} {"fontChar":"%","unicode":" ","accent":null,"width":226,"isSpace":false,"isInFont":false} {"fontChar":"","unicode":"N","accent":null,"width":659,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":"a","accent":null,"width":494,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":"m","accent":null,"width":813,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":"e","accent":null,"width":503,"isSpace":false,"isInFont":true} {"fontChar":"","unicode":":","accent":null,"width":276,"isSpace":false,"isInFont":true}



Of course I'm digging into internals, but this behavior is very curious, and the code I've examined in `canvas.js` for `showText` doesn't look like it expects this case....  Also, it looks like a spurious entry rather than a corrupted entry, because the text of `unicode` fields reads correctly.

This is just the first instance within this 1-page PDF.  We receive PDFs with 100s of pages showing this behavior 1000s of times.  Note that the `fontChar` vs `unicode` mismatch is what first drew my attention to this file.  That mismatch is common in files from this designer, but is otherwise uncommon.
Snuffleupagus commented 5 years ago

What went wrong? The third item in the list is a number.

Please explain why you consider this to be wrong! Given how the operator TJ is defined in the PDF specification, see https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf#G8.1904676 and the example just below, this is totally correct and thus expected behaviour.

[...] and the code I've examined in canvas.js for showText doesn't look like it expects this case...

Yes it does, please see https://github.com/mozilla/pdf.js/blob/5517c94d66f22cc98df6dd1dab90ced15d49f3b8/src/display/canvas.js#L1500-L1504

hughsw commented 5 years ago

Unexpected because I'm not versed in the spec and just starting to use pdf.js.

Thanks for the quick response and pointers. That's very helpful.

timvandermeij commented 5 years ago

Closing as answered.