metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

hanging first letter causes problem #35

Closed abhinavdayal closed 1 year ago

abhinavdayal commented 1 year ago

https://www.sciencedirect.com/science/article/pii/S0735109719386929?via%3Dihub

In this the first paragraph begins with a very large single character and so the title extracted is 'T' one possibility is to check for single letter and then go for the next font size.

metebalci commented 1 year ago

The algorithm called eliot and associated option set by --eliot-tfs can be used for this. It is specific to this PDF but if you have such PDFs, it might be useful. For example:

$ pdftitle -p he2020.pdf -a eliot --eliot-tfs 1
Salt Reduction to Prevent Hypertension and Cardiovascular Disease

--eliot-tfs 1 means, do not use the font with maximum size (which is T), but use the font size having the second maximum size, which is the title here.