Is it possible to find text coordinates on the page using pdftotext?

jalan / pdftotext

Simple PDF text extraction

MIT License

870 stars 99 forks source link

Is it possible to find text coordinates on the page using pdftotext? #75

Closed jeanmonet closed 4 years ago

jeanmonet commented 4 years ago

I suspect the answer is no, but wanted to check in case I'm missing something. If pdftotext / poppler is not able to provide text coordinates on the page, do you know of another reliable tool to do so?

jalan commented 4 years ago

I believe poppler can do that with some of its command-line tools, yes. But it's not a part of this python library. This library is meant to be fast and simple: all it does is extract full pages of text.

cordeliac commented 4 years ago

I guess I need to talk to the folks in the poppler project. I would like to know how it manages to mistake the the word "OTHER" for "0THER" (a zero for the letter "O" when clearly it appears as the letter "O". This document is not a "scan". A PDF should contain Font drawing instructions. I don't see how poppler could confuse the two. I guess I need to learn the internal coding of a PDF (as the folks at Poppler did) so I can see where this problem originates.

jalan commented 4 years ago

@cordeliac please only post comments that are relevant to the issue you are commenting on.

jeanmonet commented 4 years ago

Thanks for the confirmation! I appreciate the reliability of this tool for text extraction. Will see what I can do with pdfminer for a more advanced usage, although in the past I found it less reliable for accurate text extraction.