openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
364 stars 25 forks source link

Architecture Update: Split crawling part of zimit into separate project (browsertrix-core), maintained by Webrecorder #45

Closed ikreymer closed 4 years ago

ikreymer commented 4 years ago

The crawling infrastructure is now generic enough and will be use to Webrecorder as part of next-generation Browsertrix Core setup, that runs in a single container. The component can move to the Webrecorder org and have its own Docker image.

This repo will simply inherit the base Docker image and add zimit.py and warc2zim, while the crawling will be maintained by Webrecorder and will be extended to support other use cases, of course making sure that the zimit use case still works.

It may make sense to add a simply integration test (perhaps of isago.ml?) to ensure that thing are working before updating the zimit image. The plan is as follows:

@rgaudin this is sort of what we discussed yesterday, let me know if you have any thoughts/concerns on this.

rgaudin commented 4 years ago

Looks good; you seem very cautious. What kind of difficulty are you expecting? Do you plan on refactoring the crawling part in that browsertrix-core image?

ikreymer commented 4 years ago

Yes, I anticipate expanding the system and adding more options, for example, a YAML-config file, supporting multiple seeds, perhaps supporting some of the features in the current YAML files: https://github.com/webrecorder/browsertrix/tree/master/sample-crawls

Shouldn't remove any options, but just thought adding an integration test might be useful to ensure current behavior is maintained. I think it may even be possible to run on GitHub Actions now, which support Docker.

kelson42 commented 4 years ago

@ikreymer What is important is that the Docker image is properly versioned so we can always pick a specific version.

rgaudin commented 4 years ago

Yes in this case it might be wise to include a test with isago. 👍