Offset in text bounding boxes when parsing mix language documents

Yaadon commented 1 month ago

Describe the bug

When I tried to extract words in a mix language document, there was offset in all the English words, and I am not sure whether it is due to pdfplumber or pdfminer.six.

Have you tried repairing the PDF?

yes.

file = "test_api/chinese.pdf"
pdfplumber.repair("test_api/chinese.pdf", outfile="test_api/chinese_repair.pdf")

Code to reproduce the problem

import pdfplumber

if __name__ == "__main__":
    file = "test_api/chinese.pdf"
    doc = pdfplumber.open(file)
    print(len(doc.pages))
    for i, page in enumerate(doc.pages):
        im = page.to_image().outline_words()
        im.draw_rects(page.extract_words())
        im.save(f"test_api/page{i+1}.png")

PDF file

chinese.pdf

Expected behavior

I hope the bbox for English words is as accurate as for Chinese characters.

Actual behavior

There is offset in y-axis for English words.

Screenshots

Environment

pdfplumber version: 0.11.1
Python version: 3.10.14
OS: Linux

Additional context

I've tried version 0.11.2

page.to_image().outline_words()

but it did not help.

jsvine commented 1 month ago

Hi @Yaadon, and thank you for providing this example. pdfplumber depends on pdfminer.six for coordinate extraction. Have you compared the coordinates output here to the those in pdfminer.six? Are those also off? If so, better to file an issue there. Or does it appear that pdfplumber is introducing errors?

liuhuan-gl commented 1 month ago

请问找到原因了吗

liuhuan-gl commented 1 month ago

请问找到原因了吗

Yaadon commented 1 month ago

I've substituted it with pdfium, which works perfect for me, and is much faster since it is based on C libraries.

jsvine commented 1 month ago

Thanks for noting, @Yaadon. Closing this issue for now, since character-bounding-box calculations are outside the scope of pdfplumber (and are instead calculated by pdfminer.six).

jsvine / pdfplumber