I'm trying to round-trip from pdfminer server side text positions into pdf.js. Is this something you'd expect to be possible? Using the compressed.tracemonkey-pldi-09.pdf file the first block of text "Trace-based Just-in-Time Type Specialization for Dynamic" has the correct x1, x2, width and height but y1 and y2 are offset slightly.

I am not sure which library is 'correct' (or indeed if both can be correct, and this is subject to interpretation?).

If this isn't reliable across text blocks, are there other ways to safely roundtrip references to text segments in the PDF?

pdfjs:

{
    "location": {
        "x0": 80.5159,
        "x1": 529.607,
        "y0": 700.6706,
        "y1": 718.6034,
        "width": 449.0911,
        "height": 17.9328
    },
    "text": "Trace-based Just-in-Time Type Specialization for Dynamic\n"
}

pdfminer:

{
  "location": {
    "x0": 80.5159,
    "x1": 529.607,
    "y0": 696.9226,
    "y1": 714.8554,
    "width": 449.0911,
    "height": 17.9328
  },
  "text": "Trace-based Just-in-Time Type Specialization for Dynamic\n"
}

Attach (recommended) or Link to PDF file here

https://github.com/mozilla/pdf.js/blob/master/web/compressed.tracemonkey-pldi-09.pdf

Configuration:

PDF.js version: 3.8.162
PDFMiner version: 20221105

Steps to reproduce the problem:

pdfjs: (full code here: https://jsfiddle.net/jamescrowley/tzg8sb9w/32/)

  // 'item' is an item in the array from getTextContent
  return { 
    "location": { 
        "x0": Number(item.transform[4].toFixed(4)),
        "x1": Number((item.transform[4] + item.width).toFixed(4)),
        "y0": Number(item.transform[5].toFixed(4)),
        "y1": Number((item.transform[5] + item.height).toFixed(4)),
        "width": Number(item.width.toFixed(4)),
        "height": Number(item.height.toFixed(4)),
    },
    "text": item.str + (item.hasEOL ? "\n" : "")
  }

pdfminer (full code here: https://colab.research.google.com/drive/1nocq-on3mnOcsYTEvNu8ii8voF7VAP8x#scrollTo=2CqZLqUWXe_N)

# text_line is an item in the LTTextContainer arrays from extract_pages
return ({ 
    "location": { 
        "x0": round(text_line.x0, 4),
        "x1": round(text_line.x1, 4),
        "y0": round(text_line.y0, 4),
        "y1": round(text_line.y1, 4),
        "width": round(text_line.x1 - text_line.x0, 4),
        "height": round(text_line.y1 - text_line.y0, 4),
    },
    "text": text_line.get_text()
})

What is the expected behavior? (add screenshot)

x0,y0,x1,y1 match across libraries on same text block.

What went wrong? (add screenshot)

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

pdfminer / pdfminer.six

Conflicting PDF text positions reported by pdf.js vs pdfminer #895

Attach (recommended) or Link to PDF file here

Configuration:

Steps to reproduce the problem:

What is the expected behavior? (add screenshot)

What went wrong? (add screenshot)