jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Word extraction for non 0 degree characters is extracting characters and not combining characters to word when "Size" parameter is used #392

Closed sreeni5493 closed 3 years ago

sreeni5493 commented 3 years ago

rotation2.pdf 444444 (2).pdf Both these files have 90 degree oriented text. These are extracted as single characters each when we use page.extract_words(). Is this something fixable by modifying tolerance (doesnt seem to work for me)

sreeni5493 commented 3 years ago

I checked and find that some options in extra_attrs are modifying how word separations work. before I find out what it is, I will close this for now. Apologies for not identifying the bug clearly

sreeni5493 commented 3 years ago

Figured out the issue

For both these PDFs adding extra_attrs=["size"] causes character to split. The rotation2.pdf document was created in Word with no change in font size. Similarly font size for 444444 (2).pdf 90 degree characters are also same. Wondering why "size" attribute is dividing characters as words.

words = page.extract_words(use_text_flow=True, extra_attrs=["size"])

With this code, for rotation2.pdf, it returns "N" as separate word, then "i" as separate word. For entire "Ninety" which is a single word each character is returned as a separate word, when using "size" attribute.

jsvine commented 3 years ago

For whatever reason the letters in "Ninety" have these sizes:

text      size
------  ------
N        7.132
i        2.539
n        5.796
e        5.498
t        3.698
y        5.001

My best guess (given that the raw PDF directives don't appear to encode this difference, is that this comes from how pdfminer.six calculates sizes on rotated text — and that it has something to do with the fact that Word didn't quite make that rotation perfectly 90°. But pdfplumber's behavior is "correct" in this case, given that it's being presented (by pdfminer.six) with characters of non-equal sizes.

If you'd like pdfminer.six to handle that text differently (or ask for clarification on how text of that kind is handled), I'd recommend opening an issue there.

sreeni5493 commented 3 years ago

Understood, Thanks. I will open it up there.