osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)
13 stars 3 forks source link

Add snapshot image after initialization #11

Open johanneskiesel opened 5 years ago

johanneskiesel commented 5 years ago

Currently, prepare commits the intermediate image after indexing, but not after initialization: https://github.com/osirrc2019/jig/blob/4e3765cd59b0869c354b2d7c6f9da826624e470e/run.py#L47

Doing also a commit after initialization can save time, network traffic, and disk space (due to the layered file system, the downloaded files are then only stored once and not for every image).

The tag could be something like "{}-initialized".format(args.tag)

lintool commented 5 years ago

I'm 👎 on this but open to discussion.

ryan-clancy commented 5 years ago

If we did this, we would have two images. For example:

where first image would be the base image for the second.

I think this would lead into some odd lifecycle management where we'd need to update the base image of the second to be the updated (after re-init) first image, if that's even possible. Another approach may be to start a container using the second image and re-run the init script, but this again can get complicated (init scripts should then be idempotent and need to clean-up existing files before downloading new ones).

I'm :-1: on this too for now as it would add a lot of hidden complexity.

johanneskiesel commented 5 years ago

Maybe then there is confusion here: Why would you want to re-init an image? I thought init is just about setup? So my confusion is: why would I want to run setup every time I index an collection, when I can just start with a snapshot of after setup was completed?

But in case you would need to re-init an image (I can imagine if you encountered an error or so), why can't you just create both latest-initialized and latest-indexed again? I see you would need an additional "--purge" parameter (or so) for allowing people to forcing an init even if there is already an initialized image.

lintool commented 5 years ago

I think the tradeoff is more complex lifecycle management... I think we're assuming that init/index will be done once and that's it.

I suppose with all the bells and whistles we can bind each subcommand to a hook and allow committing at each phase in a flexible manner? I'm inclined to punt on this for now though...

johanneskiesel commented 5 years ago

I see, and I want to say that it is not my intention to press this issue (which might have been lost from the original mail to this issue). I'm well aware that this can be added later on without a problem (it requires no change to the specification), so you can just wait to see whether index is done just once or more often.

lintool commented 5 years ago

No worries! Thanks for your contributions!