The current training job checkpoints every N sec, and only produces a saved model at the very end. The idea is any checkpoint can be converted to a saved model later.
In order to facilitate, we should save the prediction graph out at the start of the training, so the conversion of checkpoint -> saved model doesn't depend on any code running, and re-creating the same graph with same args.
Secondly, once we have that, should the training process even produce a saved model? Is producing a saved model better considered as part of the deployment step of the workflow?
The current training job checkpoints every N sec, and only produces a saved model at the very end. The idea is any checkpoint can be converted to a saved model later.
In order to facilitate, we should save the prediction graph out at the start of the training, so the conversion of checkpoint -> saved model doesn't depend on any code running, and re-creating the same graph with same args.
Secondly, once we have that, should the training process even produce a saved model? Is producing a saved model better considered as part of the deployment step of the workflow?