openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
358 stars 25 forks source link

Automatically ignore ZIM resources found on a website to crawl #396

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

If for some resources the crawler encounters a ZIM file on a web property, we should immediately block it so that it is not included inside the WARC and then inside the ZIM.

This is probably a page block rule to be implemented in browsertrix crawler.

I don't think that we need a switch to disable the blockage, I don't see a scenario where it would make sense to ZIM a ZIM inside a ZIM ^^

benoit74 commented 1 month ago

@Popolechien this is what we've discussed ATM for your record