Closed hy00nc closed 4 years ago
Hi Haeyoon, which solver are you using?
To replicate our paper's results, please use the https://github.com/parasj/checkmate/tree/mlsys20_artifact branch rather than master. The two have diverged significantly. I would recommend using the Gurobi solver as that is what we used for our experiments.
I am actually using the Gurobi solver. I will try the above branch for now thanks!
In the mlsys20_artifact
branch, there is no actual test for e2e training (like in the tutorial of the master branch), so I tried to write a training test code on my own. I am aiming to train with the maximum batch size to see if it is actually possible. However, I found out in execution.py
that it does not support returning a gradient value which is needed for the training process:
def tfgraph_from_schedule(model, g: DFGraph, scheduled_result: ScheduledResult,
loss=categorical_cross_entropy, debug: bool = False):
def _eager_eval(input_val: tf.Tensor, label_val: tf.Tensor):
# ...
out_grads = None
return our_loss, out_grads
return _eager_eval
Also in test_execution.py
(which says that it is not valid anymore), only tests whether it calculates loss value properly or not, omitting the gradient value checking.
Maybe I have to stick to the master branch for e2e training test, trying to inspect any issue related to the solving process.
Hi, I am currently trying to reproduce the evaluation results shown in the Checkmate MLSys paper, but facing many challenges throughout the process.
As the evaluations in the paper showed that Checkmate enabled larger batch size possible for training, I tried to compare the Keras model and its corresponding Checkmate application using various batch sizes.
However, it seems that Checkmate can't find the solution for batch size which is small enough to fit in memory for the corresponding Keras model.
I tried many models provided by
load_keras_model
, and even in the test model in the tutorial, it does not work. In the test model of the tutorial, using the batch size of 2000, the Keras test model trains well(model.fit()
)without OOM while Checkmate cannot find a feasible schedule:(I am using a budget of approx 10GB (TITAN Xp), checked that the budget value assigned properly as 10976400000)
Could there be any issue with the solver or is it not fair to compare it with the original Keras model this way? I am trying to figure it myself too, but it would be a big help if you could take a look as well :)