parasj / checkmate

Training neural networks in TensorFlow 2.0 with 5x less memory
Apache License 2.0
128 stars 16 forks source link

Checkmate solver fails while the original model can be run with a specific memory budget #151

Closed hy00nc closed 4 years ago

hy00nc commented 4 years ago

Hi, I am currently trying to reproduce the evaluation results shown in the Checkmate MLSys paper, but facing many challenges throughout the process.

As the evaluations in the paper showed that Checkmate enabled larger batch size possible for training, I tried to compare the Keras model and its corresponding Checkmate application using various batch sizes.

However, it seems that Checkmate can't find the solution for batch size which is small enough to fit in memory for the corresponding Keras model.

I tried many models provided by load_keras_model, and even in the test model in the tutorial, it does not work. In the test model of the tutorial, using the batch size of 2000, the Keras test model trains well(model.fit())without OOM while Checkmate cannot find a feasible schedule:

ERROR:root:[checkmate] Checkmate solver could find no feasible schedule for the specificed budget of 10976400000.0
Traceback (most recent call last):
  File "tutorial.py", line 33, in <module>
    label_spec=element_spec[1]
  File "/home/haeyoon/checkmate/checkmate/tf2/wrapper.py", line 110, in compile_tf2
    raise ValueError("No feasible solution for specified budget of {}".format(budget))
ValueError: No feasible solution for specified budget of 10976400000.0

(I am using a budget of approx 10GB (TITAN Xp), checked that the budget value assigned properly as 10976400000)

Could there be any issue with the solver or is it not fair to compare it with the original Keras model this way? I am trying to figure it myself too, but it would be a big help if you could take a look as well :)

parasj commented 4 years ago

Hi Haeyoon, which solver are you using?

To replicate our paper's results, please use the https://github.com/parasj/checkmate/tree/mlsys20_artifact branch rather than master. The two have diverged significantly. I would recommend using the Gurobi solver as that is what we used for our experiments.

hy00nc commented 4 years ago

I am actually using the Gurobi solver. I will try the above branch for now thanks!

hy00nc commented 4 years ago

In the mlsys20_artifact branch, there is no actual test for e2e training (like in the tutorial of the master branch), so I tried to write a training test code on my own. I am aiming to train with the maximum batch size to see if it is actually possible. However, I found out in execution.py that it does not support returning a gradient value which is needed for the training process:

def tfgraph_from_schedule(model, g: DFGraph, scheduled_result: ScheduledResult,
                          loss=categorical_cross_entropy, debug: bool = False):
    def _eager_eval(input_val: tf.Tensor, label_val: tf.Tensor):
         # ...
        out_grads = None
        return our_loss, out_grads
    return _eager_eval

Also in test_execution.py(which says that it is not valid anymore), only tests whether it calculates loss value properly or not, omitting the gradient value checking.

Maybe I have to stick to the master branch for e2e training test, trying to inspect any issue related to the solving process.