openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Add utility to get PDF info for proper titles on PDF entries #168

Open benoit74 opened 2 weeks ago

benoit74 commented 2 weeks ago

Content of PDF documents is not indexed for suggestions, while on some ZIM it is the "core" of the ZIM.

Having a utility in scraperlib to extract PDF info and get the document title would probably help.

See https://github.com/openzim/warc2zim/issues/290 for one use-case.