Closed sreeni5493 closed 3 years ago
I checked and find that some options in extra_attrs are modifying how word separations work. before I find out what it is, I will close this for now. Apologies for not identifying the bug clearly
Figured out the issue
For both these PDFs adding extra_attrs=["size"] causes character to split. The rotation2.pdf document was created in Word with no change in font size. Similarly font size for 444444 (2).pdf 90 degree characters are also same. Wondering why "size" attribute is dividing characters as words.
words = page.extract_words(use_text_flow=True, extra_attrs=["size"])
With this code, for rotation2.pdf, it returns "N" as separate word, then "i" as separate word. For entire "Ninety" which is a single word each character is returned as a separate word, when using "size" attribute.
For whatever reason the letters in "Ninety" have these sizes:
text size
------ ------
N 7.132
i 2.539
n 5.796
e 5.498
t 3.698
y 5.001
My best guess (given that the raw PDF directives don't appear to encode this difference, is that this comes from how pdfminer.six
calculates sizes on rotated text — and that it has something to do with the fact that Word didn't quite make that rotation perfectly 90°. But pdfplumber
's behavior is "correct" in this case, given that it's being presented (by pdfminer.six
) with characters of non-equal sizes.
If you'd like pdfminer.six
to handle that text differently (or ask for clarification on how text of that kind is handled), I'd recommend opening an issue there.
Understood, Thanks. I will open it up there.
rotation2.pdf 444444 (2).pdf Both these files have 90 degree oriented text. These are extracted as single characters each when we use page.extract_words(). Is this something fixable by modifying tolerance (doesnt seem to work for me)