metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

Exception thrown #32

Closed voidexpr closed 1 year ago

voidexpr commented 2 years ago

Here is a pdf where the extraction fails.

Traceback (most recent call last): File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 974, in to_unichr return self.cid2unicode[cid] KeyError: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", 10.1.1.160.2604.pdf line 404, in draw_cid unichar = ts.Tf.to_unichr(cid) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 976, in to_unichr raise PDFUnicodeNotDefined(None, cid) pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 701, in run title = get_title_from_file(args.pdf) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 581, in get_title_from_file return get_title_from_io(raw_file) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 462, in get_title_from_io interpreter.process_page(page) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 991, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents self.execute(list_value(streams)) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 1036, in execute func(*args) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 292, in do_Tj self.do_TJ([s]) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 324, in do_TJ self.device.process_string(self.mpts, seq) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 378, in process_string self.draw_cid(ts, cid) File "/Users/someone/.pyenv/versions/3.7.3/lib/python3.7/site-packages/pdftitle.py", line 410, in draw_cid "exist in the font") from unicode_not_defined Exception: PDF contains a unicode char that does not exist in the font

metebalci commented 1 year ago

This error (there is a character in PDF that does not exist in the font) can be skipped with --replace-missing-char argument. For example:

$ pdftitle -p 10.1.1.160.2604.pdf --replace-missing-char ' '
GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems