jalan / pdftotext

Simple PDF text extraction
MIT License
870 stars 99 forks source link

double column pdf #114

Closed malinphy closed 1 year ago

malinphy commented 1 year ago

I believe pdftotext is very useful especially layout clearly displays the actual layout. However, academic pdfs have double columns. While scraping the double column pdf, pdftotext does not follow the order of the columns but brings the whole horizontal line from both column. Is there any way that pdftotext read the double column pdf in an ordered way? Thanks in advance.

Best Regards

jalan commented 1 year ago

Did you try with all the available layout options? The default should attempt to output the text in reading order. For example, this test PDF:

https://github.com/jalan/pdftotext/blob/master/tests/three_columns.pdf

results in this text output, which seems to be what you want:

column 1
one

column 2
two

column 3
three

Of course, if the software that created the PDF provides wrong information about the intended reading order, all bets are off.

jalan commented 1 year ago

No response, no example provided.