metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.05k stars 115 forks source link

Title detection heurisitcs #53

Closed dufferzafar closed 3 years ago

dufferzafar commented 3 years ago

Currently, PDF title is retrieved directly from the metadata info, but most PDFs (like Arxiv) don't actually have that metadata. We could have custom logic, if we detect that it is an "Arxiv" PDF which is what #52 is about, or else we could add heuristic based "guessing" of title (say from the text with largest font on the first page.) This will obviously not work everywhere. But, it doesn't have to!

I've past experience with KDE's KFileMetaData which used a similar heuristic, and it used to give good results. This was later removed though (commit), because KDE as a distro has to make a lot of people happy.

If you're okay with a heuristic based approach, I could take a stab at implementing this!

Usecase: I would really like to have a script that auto-renames my PDFs with proper titles. I actually had a script that was based on KFileMetaData, but I've since moved onto Windows. https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

dufferzafar commented 3 years ago

I just looked at the code, and the reference detection logic is also heuristical in nature:

https://github.com/metachris/pdfx/blob/9e6864c5f9bcc8801e12c63a64d6efdfd1960494/pdfx/extractor.py#L14-L22

dufferzafar commented 3 years ago

There's a haskell-based tool that's based around renaming PDFs: https://github.com/2mol/pboy but it has no title-heuristics, just metadata.

dufferzafar commented 3 years ago

I actually found a library that implements the heuristics that I talked about: https://github.com/metebalci/pdftitle

It works pretty well. So I've modified my original script to use this instead of KDE's metadump.

Closing this, because I don't think it makes any sense to re-implement this functionality in pdfx.