Open gregtatum opened 1 month ago
I agree, but we should double-check that training continuation after preemption will work this way. I think it needs a bunch of files like the optimizer config that Marian writes to the directory. Another consideration is that GCS cost is likely not too high compared to GPUs, there are ways to archive things there and make it even cheaper.
I don't know that we've pulled numbers on storage costs to know them exactly, but my assumption is that it's a function of storage size and cache duration.
It costs money to store models in the cloud. We could save a bit, and make the output of the train tasks a bit less confusing if we just stored a single final model. As far as I have seen, we never use any of the other models unless I'm missing something.
We would have to make sure that training continuation is updated and the fetching of old models for training still works.
What do you think @eu9ene?