metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

Failed to identify title #15

Closed BellLongworth closed 3 years ago

BellLongworth commented 3 years ago

Attached is a PDFs that fails, from a large similar set. The return is a single character "S". :(

E1-14.pdf

metebalci commented 3 years ago

Thanks for the example. If I am not wrong, the first letters of the title have a larger font size than the remaining letters in the word. For example, S in SETTLEMENT has a large font size value (13 vs 10.4). Because of this, S is considered as the title.

This is definitely experimental but I released v0.8 including an algorithm (-a) option. If you try with -a max2, I implemented an algorithm to workaround this problem, basically it finds the block with the largest font, but then also adds the block with a smaller font size assuming this may cover the titles like this. Additionally, this continues adding blocks until a block with a different font size is seen (this is different than the original algorithm). Finally, this results a title with some strange cases (Settlement RemainS fRom the BRonze and iRon ageS at HoRBat menoRim (el-manaRa), loweR galilee), I believe this is because of the font (but not sure), and to overcome this I also added title case (-t) option. This fixes the irregular case in the title.

To sum up, please update to v0.8, and try like this: pdftitle -p paran2010.pdf -a max2 -t