py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

`PageObject._get_fonts()` returns embedded as unembedded. #2192

Closed NewUserHa closed 1 year ago

NewUserHa commented 1 year ago
import pypdf
reader = pypdf.PdfReader(r'...')
reader.pages[3]._get_fonts()

return embedded as unembedded.

since there's a embedded font info like this:

{'/BaseFont': '/O9-PK748464-Identity-H',
 '/DescendantFonts': # [IndirectObject(448, 0, 2596729619856)]
                     {'/BaseFont': '/O9-PK748464'
                     '/CIDSystemInfo': {'/Ordering': 'PKUO1',
                     '/Registry': 'Founder',
                     '/Supplement': 0},
                     '/DW': 480,
                     '/FontDescriptor': # IndirectObject(914, 0, 2596729619856)
                                        {'/Ascent': 709,
                                        '/CapHeight': 674,
                                        '/Descent': -241,
                                        '/Flags': 32,
                                        '/FontBBox': [-115.218, -115.218, 345.65499999999997, 345.65499999999997],
                                        '/FontFile3': {'/Subtype': '/CIDFontType0C', '/Filter': ['/FlateDecode']}},
                                        '/FontName': '/O9-PK748464',
                                        '/ItalicAngle': 0,
                                        '/StemV': 91,
                                        '/Type': '/FontDescriptor'},
                     '/Subtype': '/CIDFontType0',
                     '/Type': '/Font'},
 '/Encoding': '/Identity-H',
 '/Subtype': '/Type0',
 '/ToUnicode': IndirectObject(449, 0, 2596729619856),
 '/Type': '/Font'}

This pdf is protected (unable to copy&paste), and the '/ToUnicode' is incorrect and incomplete although there's one. Therefore this case should be considered embedded. But the code https://github.com/py-pdf/pypdf/commit/e51141d7ed735703bb07f5ffa7e5d2f4d9a79347 unembedded = fonts - embedded is not right for this case.

Environment

Windows-10-10.0.17134-SP0 pypdf==3.16.0, crypt_provider=('cryptography', '38.0.4'), PIL=9.4.0

pubpub-zz commented 1 year ago

Can you please provide the pdf.

NewUserHa commented 1 year ago

WS_T 483.8-2016.pdf

pubpub-zz commented 1 year ago

_get_fonts is an internal/private functions that looks currently like deadcode. I personnally do not see any usage about this function cany ou clarify your usecase ?

NewUserHa commented 1 year ago

But It was added just last year.

It's a use case that iterates all font info to get the /toUnicode and extracts to save as files for future analysis, say like extracting text that is protected by custom fonts.

pubpub-zz commented 1 year ago

I've found : it was introduced with #1083 but the comment in #183 was considering it as still quite experimental

for your information the /Fonts are very local (to the perimeter of a page actually) that defines how to render some text. A same font name may list a limited number of cid to gid.the to unicode cmap define how to "convert" to unicodes should correspond. capitalisation of these data may not be very usefull.

However If you want to keep this approach I would personnally loop directly through ["/Ressources"]["/Fonts"] to look at datas (embedded/not embedded does not of any use in your approach)

NewUserHa commented 1 year ago

However If you want to keep this approach I would personnally loop directly through ["/Ressources"]["/Fonts"] to look at datas (embedded/not embedded does not of any use in your approach)

right. But in pypdf, fonts of page 1 locates at like resouces/../xf1/.../font, but the following pages locate at like resources/..xf[page n]/.../resources/.../xf[page n-1]/.../font, so it's inconvenient to loop through it manually, so I tried this _get_fonts function.

I'm not familiar with pdf, so don't know where fonts should be located at. But from the pdf on my hand, it seems that the fonts are usually located at some similar pattern, and the recursive every object function (e.g. the _get_fonts) could stop at any ./font depth (currently it doesn't).

The _get_fonts function is useful, and I think it can be a feature as a public function.

(pdfminer.six lacks this feature too, even though there's an internal member called cached_fonts of a class but it's more inconvenient by manual)

pubpub-zz commented 1 year ago

this PR should fix the issue