metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

Errors out with this pdf #20

Closed vprelovac closed 3 years ago

vprelovac commented 3 years ago

http://www.qwantz.com/fanart/superman.pdf

pdfminer.pdffont.PDFUnicodeNotDefined: (None, 28)

metebalci commented 3 years ago

This error is because the supplied pdf contains an undefined unicode (in this case at least the fi glyph in the title). It can be overcome by using --replace-missing-char argument (this problem can be seen in verbose mode). For example:

$ pdftitle --replace-missing-char ' ' -p superman.pdf 
AUni edtheoryofSuperman’sPowers

However, there are problems with spaces as they are not regular character spaces but instead each word is in an individual block. Unfortunately, I dont plan to implement anything to solve this at the moment. I mark this as an enhancement but closing the issue.