openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Add option to exclude some paths from front pages #378

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Currently, the fact that a ZIM item is marked is_front is purely based on the item mimetype: https://github.com/openzim/warc2zim/blob/5de5d0e0a284611ac376a328fd18b7ad7a9ad5aa/src/warc2zim/items.py#L58-L62

This has the drawback that we sometimes ends-up with unwanted front pages. Typical use case is all iframes which are meant to only be embedded within a page.

I think this could easily be solved with an additional CLI parameter containing an is_front_exclude regex on ZIM path that must not be marked is_front. I don't think having an is_front_include is necessary.

rgaudin commented 1 month ago

Didn't we already had a similar issue where we discussed getting this in-iframe information from the crawler?

benoit74 commented 1 month ago

Good point, we might even already have the information in the WARC. I don't remember exactly when / where we discussed this. Probably just using this information is serving at least 80% of the need here and in an automated way which is way superior. To be investigated