Open maxpowel opened 3 months ago
Which readers does copy and paste work in?
Okular https://okular.kde.org/ https://github.com/KDE/okular
This is the default pdf reader in KDE. With the file I provided, google-chrome only copies trash but okular copies the actual content.
I think to fix this we need to parse the CFF fonts.
My knowledge about fonts is very limited but if I can help withy anything please tell me. I found this library that does CFF stuff https://github.com/RazrFalcon/ttf-parser but I dont know if this is something that could be useful for this case.
Thanks
Hello, nice library. It is very useful and I had no issues until a find a weird PDF. Don't know if its an edge case or something common because I'm not a PDF expert.
Using pdffonts this info is shown:
If you open the PDF with a reader, you can see the text properly rendered. With some readers even copy & paste works but other just copy strange characters.
As far as I know (by reading and searching out there), the custom encoding implies non standard glyphs and this is the reason why some reader just copy trash, they are indeed copying the bytes but nothing "readable" outside the pdf context. But this is just my guessing.
I think that this is the same case https://community.adobe.com/t5/acrobat-discussions/strange-font-encoding-in-pdf-files/td-p/12472215 Looks that it affects mainly old files (mine is like 20+ old) Other people are getting the same issue https://github.com/kermitt2/grobid/issues/518
When using
pdf-extract
this is the output I get:And a bunch of lines like this. The text returned is just bytes in some encoding that are not readable.
This is a sample:
I cannot provide you the whole document but I'm attaching the first page so you can reproduce the error. page.pdf
I will investigate more and if I find anything useful I will put it here.
Thank you