Memory collapse when running the code in models/research/learned_optimizer

tensorflow / models

Models and examples built with TensorFlow

Other

76.97k stars 45.79k forks source link

Memory collapse when running the code in models/research/learned_optimizer #7937

Closed djycn closed 4 years ago

djycn commented 4 years ago

When running the code in models/research/learned_optiizer folder we encounter the problem of out of memory when trying to train the HierarchicalRNN optimizer on the problem ConvNet(it is defined as the same in problems_generator.py by default) with mnist as the dataset. In the training iterations, the beginning step is fine but dramatically went out of memory after a few steps. Our GPU device is RTX 2080ti with 11 gigabytes graphical memory and computer memory is 32 gigabytes.

This memory collapse dates back to the while_loop operation leading to a memory usage increment with iterations in trainable_optimizer.py in line 354. Besides, if the swap_memory in the while_loop is set True, it will consume all the CPU memory. However, due to the static graph that TensorFlow is based on, the memory usage couldn't possibly increase after the graph is built and finalized.

Can anyone care to help us with result this?

tensorflowbutler commented 4 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

djycn commented 4 years ago

model directory is tensorflow/models/research/learned_optimizer The platform is ubuntu 18.04 tensorflow version v1.13.1 installed from conda bazel version N/A cuda/cudnn 10.0/7.3.1 GPU RTX2080ti 11g command python metarun.py --use_second_derivatives=False

nirum commented 4 years ago

These kinds of memory issues are always a problem. When training learned optimizers using gradient based optimization, there is always this tension between the length of the unroll (number of steps used in the inner training loop) and the amount of memory. The longer the unroll, the more memory you need, because you need to backpropagate through the unrolled optimization process.

To mitigate this, you can either: use short unrolls (just a few inner training steps), or use a method for training that does not use gradients (that is, use gradient-free optimization methods to train the optimizer). For example, you can use something like Evolutionary Strategies (ES) to train learned optimizers in a more memory efficient way (https://arxiv.org/abs/1810.10180).

Unfortunately, we are no longer providing maintenance/support for this specific project. As you can imagine, if we supported all previous research projects we would quickly run out of time to do new research :)