pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.8k stars 919 forks source link

Question: Negative bbox coordinate (x1) #576

Open songohannyc opened 3 years ago

songohannyc commented 3 years ago

Hello,

Build: pip install pdfminer.six==20201018

We notice some negative bbox's x1 (the first textbox below with -1.536). Does anybody have an insight what could cause that?

I attached the xml here. Also, the textbox id 11 texts of "Version 2019.1" is not visible via PDF Viewer/Reader. I only see the text box 12 of "Version 2021.5". Not sure where textbox id 11 is hiding in the document.

We are thinking about dropping the texts from textbox with negative bbox. Any insight would be really appreciate on what could cause this situation.

Note: Sorry that we couldn't attach the PDF document here since it's from our customer.

<textbox id="11" bbox="-1.536,11.781,64.509,22.461">
<textline bbox="-1.536,11.781,64.509,22.461">
<text font="TimesNewRomanPSMT" bbox="-1.536,11.781,6.175,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">V</text>
<text font="TimesNewRomanPSMT" bbox="6.175,11.781,10.917,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">e</text>
<text font="TimesNewRomanPSMT" bbox="10.917,11.781,14.473,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">r</text>
<text font="TimesNewRomanPSMT" bbox="14.537,11.781,18.692,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">s</text>
<text font="TimesNewRomanPSMT" bbox="18.724,11.781,21.693,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">i</text>
<text font="TimesNewRomanPSMT" bbox="21.693,11.781,27.033,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">o</text>
<text font="TimesNewRomanPSMT" bbox="27.033,11.781,32.373,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">n</text>
<text font="TimesNewRomanPSMT" bbox="32.373,11.781,35.043,22.461" colourspace="DeviceGray" ncolour="0" size="10.680"> </text>
<text font="TimesNewRomanPSMT" bbox="34.926,11.781,40.266,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">2</text>
<text font="TimesNewRomanPSMT" bbox="40.319,11.781,45.659,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">0</text>
<text font="TimesNewRomanPSMT" bbox="45.712,11.781,51.052,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">1</text>
<text font="TimesNewRomanPSMT" bbox="51.106,11.781,56.446,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">9</text>
<text font="TimesNewRomanPSMT" bbox="56.499,11.781,59.169,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">.</text>
<text font="TimesNewRomanPSMT" bbox="59.169,11.781,64.509,22.461" colourspace="DeviceGray" ncolour="0" size="10.680">1</text>
<text>
</text>
</textline>
</textbox>
<textbox id="12" bbox="19.488,11.784,93.516,23.784">
<textline bbox="19.488,11.784,93.516,23.784">
<text font="TimesNewRomanPSMT" bbox="19.488,11.784,28.152,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">V</text>
<text font="TimesNewRomanPSMT" bbox="28.152,11.784,33.480,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">e</text>
<text font="TimesNewRomanPSMT" bbox="33.408,11.784,37.404,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">r</text>
<text font="TimesNewRomanPSMT" bbox="37.404,11.784,42.072,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">s</text>
<text font="TimesNewRomanPSMT" bbox="42.072,11.784,45.408,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">i</text>
<text font="TimesNewRomanPSMT" bbox="45.408,11.784,51.408,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">o</text>
<text font="TimesNewRomanPSMT" bbox="51.408,11.784,57.408,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">n</text>
<text font="TimesNewRomanPSMT" bbox="57.408,11.784,60.408,23.784" colourspace="DeviceGray" ncolour="0" size="12.000"> </text>
<text font="TimesNewRomanPSMT" bbox="60.516,11.784,66.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">2</text>
<text font="TimesNewRomanPSMT" bbox="66.516,11.784,72.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">0</text>
<text font="TimesNewRomanPSMT" bbox="72.516,11.784,78.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">2</text>
<text font="TimesNewRomanPSMT" bbox="78.516,11.784,84.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">1</text>
<text font="TimesNewRomanPSMT" bbox="84.516,11.784,87.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">.</text>
<text font="TimesNewRomanPSMT" bbox="87.516,11.784,93.516,23.784" colourspace="DeviceGray" ncolour="0" size="12.000">5</text>
<text>
</text>
</textline>
</textbox>
ducviet00 commented 2 years ago

How is it going? I faced this issue. Now I'm skipping invalid boxes.

pietermarsman commented 2 years ago

@songohannyc @ducviet00 is one of you able to share (a portion of) the PDF? Without it, this is hard to debug.