openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Add PDFs to suggestions list #290

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Content of PDF documents is not indexed for suggestions, while on some ZIM it is the "core" of the ZIM.

For instance in fas-military-medicine_en (https://dev.library.kiwix.org/viewer#fas-military-medicine_en_2024-05, or https://dev.library.kiwix.org/#lang=&q=military+medicine), there is only one main page and PDFs documents. Suggestion is not usable.

Not sure how to tackle this need, but clearly it is a bit sad to not have PDFs on suggestion lists for such ZIMs. Probably not true for all ZIMs, so maybe a CLI option to add?

rgaudin commented 1 month ago

Suggestions are based on the ZIM entry's title so that an easier task than full-text indexing.

It's easy to read PDF metadata via third party lib so if a Title is set, we could use that and default to filename otherwise.

benoit74 commented 1 month ago

Does it means you consider that all PDFs should be added to suggestions? (still not sure on my side, but I can't find an example where I do not want a PDF to be added to the suggestions if we have a proper title)

rgaudin commented 1 month ago

I have no strong opinion

benoit74 commented 2 weeks ago

Just created scraperlib issue since this should be implemented there. Not going to make it for 3.4.0.

benoit74 commented 2 weeks ago

And I think we should add the new CLI argument, better to include it now rather than being blocked on some ZIM creation due to whatever problem this might cause.