PDF content is not indexed in full text search

openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

https://pypi.org/project/warc2zim/

GNU General Public License v3.0

44 stars 4 forks source link

Closed benoit74 closed 1 month ago

benoit74 commented 4 months ago

Content of PDF documents is not indexed for full text search, while on some ZIM it is the "core" of the ZIM.

For instance in fas-military-medicine_en (https://dev.library.kiwix.org/viewer#fas-military-medicine_en_2024-05, or https://dev.library.kiwix.org/#lang=&q=military+medicine), there is only one main page and PDFs documents. Full text search is not usable.

rgaudin commented 4 months ago

Extracting PDF info would be beneficial to many scrapers and should thus ideally be exposed in scraperlib.

benoit74 commented 3 months ago

I just created the scraperlib issue to implement this. Not going to make it for 3.4.0, not sure when this will be planned.