metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

Failed to identify title from JMF #18

Closed eliotlencelot closed 3 years ago

eliotlencelot commented 3 years ago

Hello metebalci, I am not able to use pdftitle -p PDF to extract the title of scientific articles from the Journal of Medicinal Food.

For example this file do not produce a title: woo2019.pdf

Is it possible to change a bit the algorithm for this kind of articles?

I have tried the new option pdftitle -a max2 -p PDF without success. I do not see a list of parameters that can be passed to -a in the readme, so to the best of my knowledge, reading this github repository, there is only the options -a max2 and -a default. If not, please note that I have not tried other algorithms.

Thank you!

eliotlencelot commented 3 years ago

I do also have a python error raised by pdfminer when adding the verbose -v option :

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdffont.py", line 593, in to_unichr
    return self.cid2unicode[cid]
KeyError: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 589, in run
    title = get_title_from_file(args.pdf)
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 523, in get_title_from_file
    return get_title_from_io(raw_file)
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 444, in get_title_from_io
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdfinterp.py", line 933, in execute
    func(*args)
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 291, in do_Tj
    self.do_TJ([s])
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 323, in do_TJ
    self.device.process_string(self.mpts, seq)
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 374, in process_string
    self.draw_cid(ts, cid)
  File "/usr/local/lib/python3.7/dist-packages/pdftitle.py", line 394, in draw_cid
    unichar = ts.Tf.to_unichr(cid)
  File "/usr/local/lib/python3.7/dist-packages/pdfminer/pdffont.py", line 595, in to_unichr
    raise PDFUnicodeNotDefined(None, cid)
pdfminer.pdffont.PDFUnicodeNotDefined: (None, 2)
metebalci commented 3 years ago

Hello and thanks for raising this issue.

First, the error you mentioned on the second comment is because there is a character in the pdf that does not exist in the font. To overcome this, you can use --replace-missing-char option (e.g. use ' ' to replace missing chars with space). I was silently ignoring the exceptions in normal (no verbose) mode, I have changed this behavior in the new version 0.9.

I checked the pdf you linked, and the problem was the first letter (A) of the first paragraph was the largest font in the page, not the title. So I implemented another -more general than original- algorithm, called eliot, in version 0.9. With this algorithm, you can select which font size (not the absolute value but in terms of its order in size, e.g. 0 is the largest, 1 is the second largest).

Now the result is:

$ pdftitle -a eliot --eliot-tfs 1 -p woo2019.pdf --replace-missing-char ' '

Lactobacillus HY2782 and Bifidobacterium HY8002 Decrease Airway Hyperresponsiveness Induced by Chronic PM2.5 Inhalation in Mice