Memory demand not independent of depth because ctx.saved_tensors not freed?

cetmann commented 4 years ago

Hi! I did some benchmarking recently and found that the memory demand is not quite independent of the depth, see Table 3 on the last page on https://arxiv.org/abs/2005.05220 My suspicion is that operations in the initial forward pass still save tensors that are usually necessary for the backpropagation via ctx.save_for_backward(..), which are then stored in ctx.saved_tensors. Can you confirm that this is what's happening and if so, is there a way of freeing this memory as well? These saved tensors should never be needed when training with memory-efficient invertible backpropagation, as they are re-created when activations are reconstructed in the backward pass. Best, Christian

silvandeleemput commented 4 years ago

@cetmann Hi thanks for letting me know. I am currently on vacation, but I'll have a look at it once I am back.

silvandeleemput commented 4 years ago

Hi @cetmann, I have had a look at the table from your paper. Memory consumption during training is a bit complicated both to explain and to measure and requires some elaboration.

As you already have identified the memory savings achieved by the MemCNN couplings do not account for the model parameters (no memory savings when increasing model depth, unless you use weight sharing or something), but only for the activations of the feature maps during training. The latter entails that only the last activation has to be stored, such that it becomes independent of depth with memory complexity O(1). As you noted this still accounts for the majority of the memory used during training and should result in significant memory savings. In conclusion, the memory consumption for the couplings is independent of model depth for only the activations (since you need not store them) and not (necessarily) for the parameters during model training.

Measuring activations and model parameters memory usage can be done as follows:

default PyTorch memory usage: the PyTorch kernel will always need a minor amount of memory, measure this before creating/initializing your model.
model parameters: measure the memory consumption after model initialization before training and subtract the default PyTorch memory usage
model activations: measure the memory consumption during training and subtract the model parameters memory consumption and the default PyTorch memory usage

Also, measuring memory consumption in PyTorch is tricky. Don't rely on nvidia-smi for the statistics, since PyTorch uses a caching allocator, hence nvidia-smi might often show more memory being used than is actually allocated. Instead, use the torch.cuda.memory_allocated method in your code to get the actual allocated memory on a GPU device. See also https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

I hope this clarifies your findings, if you still have questions please let me know.

silvandeleemput / memcnn

Memory demand not independent of depth because ctx.saved_tensors not freed? #53