Closed jeanmonet closed 4 years ago
I believe poppler can do that with some of its command-line tools, yes. But it's not a part of this python library. This library is meant to be fast and simple: all it does is extract full pages of text.
I guess I need to talk to the folks in the poppler project. I would like to know how it manages to mistake the the word "OTHER" for "0THER" (a zero for the letter "O" when clearly it appears as the letter "O". This document is not a "scan". A PDF should contain Font drawing instructions. I don't see how poppler could confuse the two. I guess I need to learn the internal coding of a PDF (as the folks at Poppler did) so I can see where this problem originates.
@cordeliac please only post comments that are relevant to the issue you are commenting on.
I suspect the answer is no, but wanted to check in case I'm missing something. If pdftotext / poppler is not able to provide text coordinates on the page, do you know of another reliable tool to do so?