Closed NewUserHa closed 1 year ago
Can you please provide the pdf.
_get_fonts is an internal/private functions that looks currently like deadcode. I personnally do not see any usage about this function cany ou clarify your usecase ?
But It was added just last year.
It's a use case that iterates all font info to get the /toUnicode
and extracts to save as files for future analysis, say like extracting text that is protected by custom fonts.
I've found : it was introduced with #1083 but the comment in #183 was considering it as still quite experimental
for your information the /Fonts are very local (to the perimeter of a page actually) that defines how to render some text. A same font name may list a limited number of cid to gid.the to unicode cmap define how to "convert" to unicodes should correspond. capitalisation of these data may not be very usefull.
However If you want to keep this approach I would personnally loop directly through ["/Ressources"]["/Fonts"] to look at datas (embedded/not embedded does not of any use in your approach)
However If you want to keep this approach I would personnally loop directly through ["/Ressources"]["/Fonts"] to look at datas (embedded/not embedded does not of any use in your approach)
right. But in pypdf, fonts of page 1 locates at like resouces/../xf1/.../font
, but the following pages locate at like resources/..xf[page n]/.../resources/.../xf[page n-1]/.../font
, so it's inconvenient to loop through it manually, so I tried this _get_fonts
function.
I'm not familiar with pdf, so don't know where fonts should be located at.
But from the pdf on my hand, it seems that the fonts are usually located at some similar pattern, and the recursive every object function (e.g. the _get_fonts
) could stop at any ./font
depth (currently it doesn't).
The _get_fonts
function is useful, and I think it can be a feature as a public function.
(pdfminer.six lacks this feature too, even though there's an internal member called cached_fonts
of a class but it's more inconvenient by manual)
this PR should fix the issue
return embedded as unembedded.
since there's a embedded font info like this:
This pdf is protected (unable to copy&paste), and the '/ToUnicode' is incorrect and incomplete although there's one. Therefore this case should be considered embedded. But the code https://github.com/py-pdf/pypdf/commit/e51141d7ed735703bb07f5ffa7e5d2f4d9a79347
unembedded = fonts - embedded
is not right for this case.Environment
Windows-10-10.0.17134-SP0 pypdf==3.16.0, crypt_provider=('cryptography', '38.0.4'), PIL=9.4.0