Open dyxohjl666 opened 1 year ago
While using extract_pages, whole pdf page is detected as Figure which is incorrect extraction.
extract_pages
The pdf file I used is: N18-3011.pdf
My codes:
from pdfminer.high_level import extract_pages for page_layout in extract_pages("N18-3011.pdf"): for element in page_layout: print(element)
My outputs:
<LTTextBoxHorizontal(0) 129.748,26.592,465.527,46.995 'Proceedings of NAACL-HLT 2018, pages 84–91\nNew Orleans, Louisiana, June 1 - 6, 2018. c(cid:13)2017 Association for Computational Linguistics\n'> <LTTextBoxHorizontal(1) 293.317,53.182,304.226,64.091 '84\n'> <LTFigure(Fm4) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '85\n'> <LTFigure(Fm8) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '86\n'> <LTFigure(Fm11) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '87\n'> <LTFigure(Fm14) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '88\n'> <LTFigure(Fm17) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '89\n'> <LTFigure(Fm20) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '90\n'> <LTFigure(Fm23) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]> <LTTextBoxHorizontal(0) 293.317,53.182,304.226,64.091 '91\n'> <LTFigure(Fm26) 0.001,0.001,595.276,841.889 matrix=[1.00,0.00,0.00,1.00, (0.00,0.00)]>
mark
While using
extract_pages
, whole pdf page is detected as Figure which is incorrect extraction.The pdf file I used is: N18-3011.pdf
My codes:
My outputs: