osirrc / jig

Jig for the Open-Source IR Replicability Challenge (OSIRRC)
13 stars 3 forks source link

How to avoid indexing after re-build docker? #110

Open matthew-z opened 5 years ago

matthew-z commented 5 years ago

It seems that jig will perform index and commit it to a new image. If my understanding is correct, after modifying the source code and building a new docker, we also have to re-index to create a new image. I wonder how to avoid it.

I think the most straightforward way is that the index is a directory of the host machine, and it will be mounted into the docker container when we launch it. Thus, even the image is destroyed or outdated, we can still mount the index directory to a new docker container.

albpurpura commented 5 years ago

Hey Matthew, I proposed this option in the beginning whenever we started the design of the jig. In the end @lintool proposed to save the index in the image within the docker to reduce the loading times. I implemented what you just proposed for the training jig instead, which saves the data to an external file and allows the sharing of the trained models between images.

In any case you could save the index from one image to your host machine, then load the index data again if you wanted to.

arjenpdevries commented 5 years ago

I was thinking to do it similarly. A good way would be to add one flag to pass a directory to be mounted as a volume for data storage, just like the /input mount - did you do that or just hardcode it @albpurpura?

matthew-z commented 5 years ago

I see, we can use the model_folder to mount any data to docker with train hook. Then, I think it will be great to add a similar arg to other hooks for mounting data from host machine.

albpurpura commented 5 years ago

@arjenpdevries I did it exactly as you said. The folder to mount is passed as an argument, have a look here https://github.com/osirrc/jig/blob/master/trainer.py

lintool commented 5 years ago

In the end @lintool proposed to save the index in the image within the docker to reduce the loading times.

Correct. This is a tradeoff between jig complexity (one more thing the jig needs to manage) vs. image efficiency (having to rebuild the index each time). At the start, we opted to simplify the jig since we were just getting started. However, now that things are working, I'm happy to revisit for v2.

cmacdonald commented 5 years ago

@matthew-z I had some scripts that allowed to update the scripts in an already existing image. See https://github.com/osirrc/terrier-docker/blob/master/dev/bumpContainer.sh

matthew-z commented 5 years ago

@cmacdonald Great! Thank you!