pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.64k stars 905 forks source link

Figure detected incorrect #818

Open dyxohjl666 opened 1 year ago

dyxohjl666 commented 1 year ago

While using extract_pages, whole pdf page is detected as Figure which is incorrect extraction.

The pdf file I used is: N18-3011.pdf

My codes:

from pdfminer.high_level import extract_pages
for page_layout in extract_pages("N18-3011.pdf"):
    for element in page_layout:
        print(element)

My outputs:

<LTTextBoxHorizontal(0) 129.748,26.592,465.527,46.995 'Proceedings of NAACL-HLT 2018, pages 84–91\nNew Orleans, Louisiana, June 1 - 6, 2018. c(cid:13)2017 Association for Computational Linguistics\n'>
<LTTextBoxHorizontal(1) 293.317,53.182,304.226,64.091 '84\n'>
<LTFigure(Fm4) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '85\n'>
<LTFigure(Fm8) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '86\n'>
<LTFigure(Fm11) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '87\n'>
<LTFigure(Fm14) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '88\n'>
<LTFigure(Fm17) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '89\n'>
<LTFigure(Fm20) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '90\n'>
<LTFigure(Fm23) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
<LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '91\n'>
<LTFigure(Fm26) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
xsank commented 2 months ago

mark