Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files

pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

https://pdfminersix.readthedocs.io

MIT License

5.94k stars 930 forks source link

Text line coordinates/boundingboxes have a wrong constant offset in y-direction in some extracted pdf files #618

Open yeus opened 3 years ago

yeus commented 3 years ago

Bug report

A description of the bug
All textlines are offset in y-direction by a constant amount.

Here is an example where I plottet the textboxes vs. the background pdf. As you can see graphics-elements have the "correct" coordinates (dottet lines match up against the grey pdf graphics), but the textlines have a constant offset.

The example pdf file can be found here: https://files.sma.de/downloads/SBSxx-10-DS-en-30.pdf

I am not sure, whats causing this I suspect it might have something o do with the font? Any help would be appreciated.

yeus commented 3 years ago

so after a little more investigation I am pretty sure that the error happens somewhere around here:

https://github.com/pdfminer/pdfminer.six/blob/22f90521b823ac5a22785d1439a64c7bdf2c2c6d/pdfminer/layout.py#L306

the position of the characters baseline from the matrix is actually correct. That means when calculating the bounding box the bbox_lower_left has to be wrong.

In my case for example descent + rise where rise is zero. so one of the two seems to have the wrong value ...

--> I am not sure, where rise is coming from, but descent an rise seems to come from two different sources rise is extracted much earlier, while descent comes directly from the font object. I am not sure if that makes sense and "rise" should actually be "ascent" from the font object just like descent ...

at least when I calculate descent + ascent it gives me the right boundingbox for the character (in y-direction).. (although I haven't checked out more pdfs)

I would appreciate if someone with more knowledge of the codebase could help me out here ..

yeus commented 3 years ago

just checked with some other pdfs and in all pdfs that I usd, the rise variable is always set to zero, even though it works correctly. Using the ascent would make those not work anymore. It seems the blame is solely on the descent variable and its calculation. Not sure, why for this pdf it is too large...

I am not sure though how this could happen though, from what I see, the calculation of the descent variable is pretty straight forward.

sreeni5493 commented 3 years ago

Did you figure out any solution?

sreeni5493 commented 3 years ago

I tried removing descent. It worked for some cases. Need to check more cases.

yeus commented 3 years ago

Did you figure out any solution?

No :(. Still no idea what exactly is causing this