Resource exhausted error in the middle of training

sameerg07 commented 5 years ago

I am using a 1070 and 8GB graphic card, and I still get the OOM resource exhausted method in between training a mnist for default configuration. I even Reduced the batch sizefrom 32->16->8 , can you help me with this?

titu1994 commented 5 years ago

I've been trying to get it the work as well but I'm coming up short. There are two issues causing this.

TF needs everything to be the same dtype during computation of the adaptive step of dopri5, so everything is cast to float64. This is basically your entire model being double the size it normally is.

On top of that, I haven't been successful in implementing the Adjoint method which computes the backward augmented ode in fixed number of function calls and therefore fixed amounts of memory. So the number of function calls explodes and eventually causes an oom.

At best, try it with a large tolerance of 0.1 or 0.2. Normally, this is going to lead to worse results though. I doubt it would work even then.

I am currently waiting on TF 2.0 to make a guess of how I can implement Adjoint method. There's no direct equivalent to PyTorch Function class for me to port properly.

sameerg07 commented 5 years ago

I am using adam not dopri

titu1994 commented 5 years ago

Both of the adaptive step methods are costly, Adam takes more time than dopri5 most of the time.

sameerg07 commented 5 years ago

I wanted to classify faces95 using ODENET but, it would be greatly helpful if can use the mnist code

titu1994 commented 5 years ago

Try the PyTorch version. It's more efficient currently and has the Adjoint method available

sameerg07 commented 5 years ago

okay!

titu1994 commented 5 years ago

An update on this issue, seems simply using the Euler method (or any fixed step solver for that matter (euler, huen, midpoint, rk4)) seems to properly train a reduced model without issue. I finished training one to completion (160 epochs) and have posted the checkpoint along with the log file.

MNIST is an extremely easy task for CNNs, and it quickly gets 99.37 even with just 62k parameters and a single ODEBlock.

I'm considering writing a script for CIFAR 10, which would be more indicative or the actual performance of ODENets.

Also, theres this new paper from Berkeley AI Research - ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs, which compares against the standard ODENet from the Neural ODE paper on CIFAR 10 and 100 and finds that RK45 (Dopri5 solver) diverges on both tasks for standard ODENets, so even with the Adjoint method propsed by Neural ODE paper, more difficult tasks cannot be solved using the Dopri5.

Hopefully, the authors of the BAIR paper eventually release the code upon acceptance, at which point I hope to directly implement the checkpointed variant with Discretize-Then-Optimize method (but that might take upto a year, maybe 6 months if we're lucky).

titu1994 / tfdiffeq

Resource exhausted error in the middle of training #1