jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Handling curved characters and extracting words for curved characters #404

Closed sreeni5493 closed 3 years ago

sreeni5493 commented 3 years ago

Artwork_bavaria.pdf For this example: There are curved text. The curved text have the upright flag as "True". Also when extracting words using extract_words, they split into characters or half broken words. Anyway this could be solved?

samkit-jain commented 3 years ago

Possibly a duplicate of #400

jsvine commented 3 years ago

Very interesting example PDF, thank you for sharing. Outside of the fact that pdfminer.six has a very narrow definition of "upright" — see prior discussion here — I don't see a bug here. But — per that same discussion — I do think it's worth trying to get pdfminer.six to expose each character's matrix property, which would let you identify arbitrarily rotated characters and handle them with your own logic:

Of course, as you note, someone might be interested to know the precise rotation, rather than just the binary upright value. For that, pdfplumber could expose the matrix property, which it currently does not — but I'll add this to my to-do list. From that, you could calculate the rotation (or perhaps pdfplumber could provide a utility function to do the same).

(Handling arbitrarily-rotated characters is, however, I think outside the scope of extract_words(...). I cannot think of a widely-applicable heuristic that one could apply to situations like these.)

Closing for now, but feel free to continue the discussion.