jettify / pytorch-optimizer

torch-optimizer -- collection of optimizers for Pytorch
Apache License 2.0
3.03k stars 298 forks source link

How to stop memory leak while using adahessian? #540

Open Shubh-Goyal-07 opened 2 months ago

VadisettyRahul commented 1 week ago
  1. Use torch.no_grad() Where Applicable: Ensure that gradient calculations are disabled when not needed, such as during validation or inference, to save memory.

    with torch.no_grad():
    # Validation or inference code
  2. Delete Unused Variables: Remove any intermediate tensors or variables that are no longer needed. This can be done with Python’s del statement followed by clearing the GPU cache if using CUDA.

    del variable_name
    torch.cuda.empty_cache()  # Clears GPU memory
  3. Enable Gradient Checkpointing: This reduces memory consumption by recomputing parts of the graph during the backward pass rather than storing all intermediate activations. Useful when dealing with large models. from torch.utils.checkpoint import checkpoint

# Example of gradient checkpointing
output = checkpoint(model, input_data)
  1. Optimize Batch Size: Large batch sizes consume more memory. Reducing the batch size helps prevent memory overflows.

  2. Detach Unnecessary Tensors: Use detach() to prevent PyTorch from retaining computation graphs for tensors that no longer require gradient tracking. tensor = tensor.detach()

  3. Use torch.cuda.empty_cache() Regularly: In GPU operations, periodically clearing the cache can help release memory back to the GPU and prevent leaks. torch.cuda.empty_cache()

  4. Monitor Memory Usage: Use PyTorch's memory profiling tools to track and profile memory usage during the training process.

    import torch
    print(torch.cuda.memory_summary())
  5. Check for Redundant Hessian Calculations: Since AdaHessian performs second-order derivative calculations, ensure they are optimized and not needlessly repeated, as these calculations are memory-intensive.