jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Font properties for word and characters #368

Closed sreeni5493 closed 3 years ago

sreeni5493 commented 3 years ago

Can we get all the font properties such as orientation of text, font bold or italics or underline, font color (Stroke color and fill color)?

jsvine commented 3 years ago

For characters

For words

Passing a list of extra_attrs (e.g., ["fontname", "non_stroking_color"]) to page.extract_words(...) (see here) will restrict each word to characters that share exactly the same value for each of those attributes, and the resulting word dicts will indicate those attributes.


Closing this issue for now, but feel free to continue the discussion here.

sreeni5493 commented 3 years ago

rotation2.pdf

Hi please check this PDF. What I want is to know which orientation each text is. For example: The word "Ninety" is 90 degrees in anti-clockwise. Where is this information stored. upright just gives True or False. Can we get orientation or angle of text (like 0 degree, 45 degree, 90 degree, etc)

mkl-public commented 3 years ago

As an aside, the word "Ninety" is not only 90 degrees in anti-clockwise, it's also not drawn as text but as a series of vector graphics instructions. This complicates text extraction.

sreeni5493 commented 3 years ago

so any non 0 oriented text, we cant get info on orientation of text? Or am I understanding this wrong? Is there any example where upright gives orientation apart from True?

sreeni5493 commented 3 years ago

Also I re-checked. @mkl-public "Ninety" is available as text. You can copy paste and also highlight in PDF. Issue is the text "Arbitrary" which I created in word as text and rotated by approximately 135 degree anti clockwise. I am unable to highlight this text in PDF. But ya "Ninety" is available as text.

jsvine commented 3 years ago

@sreeni5493: This is a tricky one, and may relate to the particular way Microsoft Word, which was used to create the PDF, generates rotated text. Let's take the "Ninety" as an example...

pdfminer.six, which pdfplumber uses to extract PDF information, defines upright in this manner, within the LTChar class:

        (a, b, c, d, e, f) = self.matrix
        self.upright = (0 < a*d*scaling and b*c <= 0)

In the PDF you've shared, the "Ninety" is generated this way (with my comments starting %):

BT  % Begin text object
/F1 11.04 Tf  % Set text font and size
-0.000000044 1 -1 -0.000000044 133.82 468.65 Tm  % Set text matrix and text line matrix
0 g  % Set the non-stroking color 
0 G  % Set the stroking color 
[(N)5(in)5(et)-3(y)-3( degr)15(ee)-3( anti)12(clo)6(ckw)-4(is)12(e)] TJ  % Show text string
ET  % End text object

(For more on these operators, see the official PDF reference.)

As you can see, the first and fourth arguments of the text matrix are just barely not 0. Perceptually, it's nearly impossible to see the difference in the PDF. But, according to the pdfminer.six calculation, upright==1. If you edit the raw, decompressed PDF to set the matrix instead to -0.000000000 1 -1 -0.000000000 133.82 468.65 Tm, then you will get upright==0 in pdfplumber for that text.

Of course, as you note, someone might be interested to know the precise rotation, rather than just the binary upright value. For that, pdfplumber could expose the matrix property, which it currently does not — but I'll add this to my to-do list. From that, you could calculate the rotation (or perhaps pdfplumber could provide a utility function to do the same).

sreeni5493 commented 3 years ago

@jsvine There are some libraries such as TET from PDFLib which do that.

https://www.pdflib.com/fileadmin/pdflib/pdf/manuals/TET-5.2-manual.pdf

Maybe this could be of use as to how they do it. They have angle alpha which does this for them. Wondering how they capture this. For every character their trial version gives this data.

mkl-public commented 3 years ago

@sreeni5493

"Ninety" is available as text. You can copy paste and also highlight in PDF.

I meanwhile found it in the content stream, yes, but Acrobat Reader here still doesn't allow me to select it. Weird.

But sorry for the incorrect aside...

jsvine commented 2 years ago

Update: Just pushed an attempt at making the current transformation matrix (CTM) accessible for characters, and introduced a class (pdfplumber.ctm.CTM) for calculating its scale, skew, and translation: https://github.com/jsvine/pdfplumber/commit/ae6f99e691de72987ab8166403a39776a41d4c30

It turns out that CTMs achieve rotation by combining scale and skew. In most standard cases, it seems, rotation will be equal to the x-axis skew. So, to calculate the rotation, you can run something like this:

from pdfplumber.ctm import CTM
my_char = pdf.pages[0].chars[3]
my_char_ctm = CTM(*my_char["matrix"])
my_char_rotation = my_char_ctm.skew_x

To test this out, you can install the unreleased code via pip install -e git+https://github.com/jsvine/pdfplumber.git@feature-ctm

If you get the chance, please let me know whether it suits your particular situation.