Closed dufferzafar closed 3 years ago
I just looked at the code, and the reference detection logic is also heuristical in nature:
There's a haskell-based tool that's based around renaming PDFs: https://github.com/2mol/pboy but it has no title-heuristics, just metadata.
I actually found a library that implements the heuristics that I talked about: https://github.com/metebalci/pdftitle
It works pretty well. So I've modified my original script to use this instead of KDE's metadump.
Closing this, because I don't think it makes any sense to re-implement this functionality in pdfx.
Currently, PDF title is retrieved directly from the metadata info, but most PDFs (like Arxiv) don't actually have that metadata. We could have custom logic, if we detect that it is an "Arxiv" PDF which is what #52 is about, or else we could add heuristic based "guessing" of title (say from the text with largest font on the first page.) This will obviously not work everywhere. But, it doesn't have to!
I've past experience with KDE's KFileMetaData which used a similar heuristic, and it used to give good results. This was later removed though (commit), because KDE as a distro has to make a lot of people happy.
If you're okay with a heuristic based approach, I could take a stab at implementing this!
Usecase: I would really like to have a script that auto-renames my PDFs with proper titles. I actually had a script that was based on KFileMetaData, but I've since moved onto Windows. https://github.com/dufferzafar/.scripts/blob/master/pdf-titles