jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

Fonts with custom encoding #85

Open maxpowel opened 3 months ago

maxpowel commented 3 months ago

Hello, nice library. It is very useful and I had no issues until a find a weird PDF. Don't know if its an edge case or something common because I'm not a PDF expert.

Using pdffonts this info is shown:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
DRRVLN+AdvPSTim-I                    Type 1C           Custom           yes yes no      12  0
RHJJFZ+AdvSPSMI                      Type 1C           Custom           yes yes no      14  0
FFLANR+AdvPSTim-B                    Type 1C           Custom           yes yes no      16  0
QVMPSS+AdvP4C4E74                    Type 1C           Custom           yes yes no      18  0
JMUKNO+AdvP4C4E51                    Type 1C           Custom           yes yes no      20  0
PTWKHW+AdvPSSym                      Type 1C           Custom           yes yes no      22  0
APZBCX+AdvPSTim                      Type 1C           Custom           yes yes no       8  0
DYRDCR+AdvP4C4E59                    Type 1C           Custom           yes yes no      10  0

If you open the PDF with a reader, you can see the text properly rendered. With some readers even copy & paste works but other just copy strange characters.

As far as I know (by reading and searching out there), the custom encoding implies non standard glyphs and this is the reason why some reader just copy trash, they are indeed copying the bytes but nothing "readable" outside the pdf context. But this is just my guessing.

I think that this is the same case https://community.adobe.com/t5/acrobat-discussions/strange-font-encoding-in-pdf-files/td-p/12472215 Looks that it affects mainly old files (mine is like 20+ old) Other people are getting the same issue https://github.com/kermitt2/grobid/issues/518

When using pdf-extract this is the output I get:

unknown glyph name 'C68' for font APZBCX+AdvPSTim
unknown glyph name 'C101' for font APZBCX+AdvPSTim
unknown glyph name 'C116' for font APZBCX+AdvPSTim
unknown glyph name 'C114' for font APZBCX+AdvPSTim
unknown glyph name 'C109' for font APZBCX+AdvPSTim
unknown glyph name 'C105' for font APZBCX+AdvPSTim
unknown glyph name 'C110' for font APZBCX+AdvPSTim
unknown glyph name 'C97' for font APZBCX+AdvPSTim
unknown glyph name 'C111' for font APZBCX+AdvPSTim
unknown glyph name 'C102' for font APZBCX+AdvPSTim
unknown glyph name 'C100' for font APZBCX+AdvPSTim
unknown glyph name 'C115' for font APZBCX+AdvPSTim
unknown glyph name 'C108' for font APZBCX+AdvPSTim

And a bunch of lines like this. The text returned is just bytes in some encoding that are not readable.

This is a sample:

    -$  #  #     . 
   & .  '        /
 0120   3   4 5    &   '       
 (($1(0  ) 4     0  
) $  /  6 & /  6  
 '    / 7     8 9   : ;  
 4    /$<=    '       
  .      #'   '  : ; 9 4  5  
 &              #
  .   $(() !#   > ;  #

I cannot provide you the whole document but I'm attaching the first page so you can reproduce the error. page.pdf

I will investigate more and if I find anything useful I will put it here.

Thank you

jrmuizel commented 2 months ago

Which readers does copy and paste work in?

maxpowel commented 2 months ago

Okular https://okular.kde.org/ https://github.com/KDE/okular

This is the default pdf reader in KDE. With the file I provided, google-chrome only copies trash but okular copies the actual content.

jrmuizel commented 2 months ago

I think to fix this we need to parse the CFF fonts.

maxpowel commented 1 month ago

My knowledge about fonts is very limited but if I can help withy anything please tell me. I found this library that does CFF stuff https://github.com/RazrFalcon/ttf-parser but I dont know if this is something that could be useful for this case.

Thanks