Closed sreeni5493 closed 3 years ago
Possibly a duplicate of #400
Very interesting example PDF, thank you for sharing. Outside of the fact that pdfminer.six
has a very narrow definition of "upright" — see prior discussion here — I don't see a bug here. But — per that same discussion — I do think it's worth trying to get pdfminer.six
to expose each character's matrix
property, which would let you identify arbitrarily rotated characters and handle them with your own logic:
Of course, as you note, someone might be interested to know the precise rotation, rather than just the binary
upright
value. For that,pdfplumber
could expose thematrix
property, which it currently does not — but I'll add this to my to-do list. From that, you could calculate the rotation (or perhapspdfplumber
could provide a utility function to do the same).
(Handling arbitrarily-rotated characters is, however, I think outside the scope of extract_words(...)
. I cannot think of a widely-applicable heuristic that one could apply to situations like these.)
Closing for now, but feel free to continue the discussion.
Artwork_bavaria.pdf For this example: There are curved text. The curved text have the upright flag as "True". Also when extracting words using extract_words, they split into characters or half broken words. Anyway this could be solved?