kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
213 stars 68 forks source link

Missing Words while extracting from PDF #167

Open abhiwins opened 1 month ago

abhiwins commented 1 month ago

Lot of words are missing when the data is extracted from the PDF. Scenario :- In event of large text pages more than( 1000) words.

lfoppiano commented 1 month ago

Hi @abhiwins could you please provide some examples? Including input pdf and output. Also, on which OS/platform did you run it?

Thank you

abhiwins commented 1 month ago

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

Test_pdf_word_issue Test_pdf_word_issue.pdf