Intelligent crawling by resource types to respect maximum file size

fluidicice commented 1 month ago

Due to the limitations on creating ZIM files with Zimit and the seemingly random order the web pages are downloaded, could an advanced option be added to exclude certain types of files? E.g. .mp4, .avi, .jpg

I have found that videos have been downloaded on some larger websites before Zimit grabs some of the HTML pages before the limit(s) are reached, leaving a lot of broken links but including some un-necessary videos.

Alternatively, a tickbox to start grabbing the HTML files first would be helpful, followed by pictures and then videos if there's still space remaining.

Abel-Trans commented 1 month ago

It's benefitial to know how the tool works in detail. Hope the readme.md can be more detailed.

kelson42 commented 1 month ago

@Abel-Trans We are really wanting to improve the documentation. If you have questions, please open issues. One issue per question. Based on this issues, we will update the documentation.

rgaudin commented 1 month ago

There are really two requests in one here. Excluding filetype is easy with existing --exclude using path extensions. Excludi ng by detected mimetype could be an addition.

The second request about fetching resources is interesting. I think we need a URL and limit details because in my understanding resources are found in page at parsing time and that is just before the page is added to WARC. So resources from a page inside the WARC but the page referencing them not in the WARC seems unlikely. The opposite is more likely though: HTML is included but not all resources are added because the limit has been reached. This could be an option. Is that what you were describing?

fluidicice commented 1 month ago

@rgaudin Thanks for your reply, that's almost correct - Videos linked to already-downloaded HTML pages were included but before all the other HTML pages have been downloaded - leaving some fully functional pages with videos and a lot of dead HTML links to other pages in the website.

The aim is to start with the most compact form of information - text (in this case HTML files). From there it will fill out the rest of the remaining data limit with less dense information - pictures next and then videos last if there's still remaining space - if that makes sense.

May I have an example of an --exclude file type please? I've read the manual and couldn't work it out.

benoit74 commented 1 month ago

--exclude parameter will only exclude pages, not resource inside a given page. So for instance if a page embeds an mp4 player, then the mp4 will be fetched. This is probably why you don't achieve to make it work.

It is possible in browsertrix crawler to exclude page resources with the --blockRules parameter (see https://crawler.docs.browsertrix.com/user-guide/crawl-scope/#scope-rule-examples), but this is not yet available in zimit. But then these resources will be completely ignored.

If I get you correctly, what you would like is to consider how to best allocate a given archive size, so that we ensure to have at least all HTML/JS/CSS, if possible all images, and if possible all videos. This is an extremely new and complex feature, pretty sure this is something which will be possible to work out in the coming months without significant funding, it is far more than only scraper maintenance, not an easy one (but still meaningful).

I propose to keep this issue focused on this main concern, I've moved the fact that zimit should support --blockRules in https://github.com/openzim/zimit/issues/353

openzim / zimit

Intelligent crawling by resource types to respect maximum file size #349