Closed Yaadon closed 1 month ago
Hi @Yaadon, and thank you for providing this example. pdfplumber
depends on pdfminer.six
for coordinate extraction. Have you compared the coordinates output here to the those in pdfminer.six
? Are those also off? If so, better to file an issue there. Or does it appear that pdfplumber
is introducing errors?
请问找到原因了吗
请问找到原因了吗
I've substituted it with pdfium
, which works perfect for me, and is much faster since it is based on C libraries.
Thanks for noting, @Yaadon. Closing this issue for now, since character-bounding-box calculations are outside the scope of pdfplumber
(and are instead calculated by pdfminer.six
).
Describe the bug
When I tried to extract words in a mix language document, there was offset in all the English words, and I am not sure whether it is due to pdfplumber or pdfminer.six.
Have you tried repairing the PDF?
yes.
Code to reproduce the problem
PDF file
chinese.pdf
Expected behavior
I hope the bbox for English words is as accurate as for Chinese characters.
Actual behavior
There is offset in y-axis for English words.
Screenshots
Environment
Additional context
I've tried version 0.11.2
but it did not help.