openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Add utility to index PDF documents content #167

Open benoit74 opened 2 weeks ago

benoit74 commented 2 weeks ago

Content of PDF documents is not indexed for full text search, while on some ZIM it is the "core" of the ZIM.

Extracting PDF info would be beneficial to many scrapers and should thus ideally be exposed in scraperlib.

See e.g. https://github.com/openzim/warc2zim/issues/289