Closed ikreymer closed 4 years ago
Looks good; you seem very cautious. What kind of difficulty are you expecting? Do you plan on refactoring the crawling part in that browsertrix-core image?
Yes, I anticipate expanding the system and adding more options, for example, a YAML-config file, supporting multiple seeds, perhaps supporting some of the features in the current YAML files: https://github.com/webrecorder/browsertrix/tree/master/sample-crawls
Shouldn't remove any options, but just thought adding an integration test might be useful to ensure current behavior is maintained. I think it may even be possible to run on GitHub Actions now, which support Docker.
@ikreymer What is important is that the Docker image is properly versioned so we can always pick a specific version.
Yes in this case it might be wise to include a test with isago. 👍
The crawling infrastructure is now generic enough and will be use to Webrecorder as part of next-generation Browsertrix Core setup, that runs in a single container. The component can move to the Webrecorder org and have its own Docker image.
This repo will simply inherit the base Docker image and add
zimit.py
and warc2zim, while the crawling will be maintained by Webrecorder and will be extended to support other use cases, of course making sure that the zimit use case still works.It may make sense to add a simply integration test (perhaps of isago.ml?) to ensure that thing are working before updating the zimit image. The plan is as follows:
webrecorder/browsertrix-core
docker image@rgaudin this is sort of what we discussed yesterday, let me know if you have any thoughts/concerns on this.