pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.64k stars 905 forks source link

Conflicting PDF text positions reported by pdf.js vs pdfminer #895

Open jamescrowley opened 1 year ago

jamescrowley commented 1 year ago

I'm trying to round-trip from pdfminer server side text positions into pdf.js. Is this something you'd expect to be possible? Using the compressed.tracemonkey-pldi-09.pdf file the first block of text "Trace-based Just-in-Time Type Specialization for Dynamic" has the correct x1, x2, width and height but y1 and y2 are offset slightly.

I am not sure which library is 'correct' (or indeed if both can be correct, and this is subject to interpretation?).

If this isn't reliable across text blocks, are there other ways to safely roundtrip references to text segments in the PDF?

pdfjs:

{
    "location": {
        "x0": 80.5159,
        "x1": 529.607,
        "y0": 700.6706,
        "y1": 718.6034,
        "width": 449.0911,
        "height": 17.9328
    },
    "text": "Trace-based Just-in-Time Type Specialization for Dynamic\n"
}

pdfminer:

{
  "location": {
    "x0": 80.5159,
    "x1": 529.607,
    "y0": 696.9226,
    "y1": 714.8554,
    "width": 449.0911,
    "height": 17.9328
  },
  "text": "Trace-based Just-in-Time Type Specialization for Dynamic\n"
}

Attach (recommended) or Link to PDF file here

https://github.com/mozilla/pdf.js/blob/master/web/compressed.tracemonkey-pldi-09.pdf

Configuration:

Steps to reproduce the problem:

  1. pdfjs: (full code here: https://jsfiddle.net/jamescrowley/tzg8sb9w/32/)
  // 'item' is an item in the array from getTextContent
  return { 
    "location": { 
        "x0": Number(item.transform[4].toFixed(4)),
        "x1": Number((item.transform[4] + item.width).toFixed(4)),
        "y0": Number(item.transform[5].toFixed(4)),
        "y1": Number((item.transform[5] + item.height).toFixed(4)),
        "width": Number(item.width.toFixed(4)),
        "height": Number(item.height.toFixed(4)),
    },
    "text": item.str + (item.hasEOL ? "\n" : "")
  }
  1. pdfminer (full code here: https://colab.research.google.com/drive/1nocq-on3mnOcsYTEvNu8ii8voF7VAP8x#scrollTo=2CqZLqUWXe_N)
# text_line is an item in the LTTextContainer arrays from extract_pages
return ({ 
    "location": { 
        "x0": round(text_line.x0, 4),
        "x1": round(text_line.x1, 4),
        "y0": round(text_line.y0, 4),
        "y1": round(text_line.y1, 4),
        "width": round(text_line.x1 - text_line.x0, 4),
        "height": round(text_line.y1 - text_line.y0, 4),
    },
    "text": text_line.get_text()
})

What is the expected behavior? (add screenshot)

x0,y0,x1,y1 match across libraries on same text block.

What went wrong? (add screenshot)

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

  1. pdf.js: https://jsfiddle.net/jamescrowley/tzg8sb9w/32/
  2. pdfminer: https://colab.research.google.com/drive/1nocq-on3mnOcsYTEvNu8ii8voF7VAP8x#scrollTo=2CqZLqUWXe_N
jamescrowley commented 1 year ago

For reference, resolved this on the pdf.js side: https://github.com/mozilla/pdf.js/issues/16634