Use torch.no_grad() Where Applicable: Ensure that gradient calculations are disabled when not needed, such as during validation or inference, to save memory.
with torch.no_grad():
# Validation or inference code
Delete Unused Variables: Remove any intermediate tensors or variables that are no longer needed. This can be done with Python’s del statement followed by clearing the GPU cache if using CUDA.
del variable_name
torch.cuda.empty_cache() # Clears GPU memory
Enable Gradient Checkpointing: This reduces memory consumption by recomputing parts of the graph during the backward pass rather than storing all intermediate activations. Useful when dealing with large models.
from torch.utils.checkpoint import checkpoint
# Example of gradient checkpointing
output = checkpoint(model, input_data)
Optimize Batch Size: Large batch sizes consume more memory. Reducing the batch size helps prevent memory overflows.
Detach Unnecessary Tensors: Use detach() to prevent PyTorch from retaining computation graphs for tensors that no longer require gradient tracking.
tensor = tensor.detach()
Use torch.cuda.empty_cache() Regularly: In GPU operations, periodically clearing the cache can help release memory back to the GPU and prevent leaks.
torch.cuda.empty_cache()
Monitor Memory Usage: Use PyTorch's memory profiling tools to track and profile memory usage during the training process.
import torch
print(torch.cuda.memory_summary())
Check for Redundant Hessian Calculations: Since AdaHessian performs second-order derivative calculations, ensure they are optimized and not needlessly repeated, as these calculations are memory-intensive.
Use torch.no_grad() Where Applicable: Ensure that gradient calculations are disabled when not needed, such as during validation or inference, to save memory.
Delete Unused Variables: Remove any intermediate tensors or variables that are no longer needed. This can be done with Python’s del statement followed by clearing the GPU cache if using CUDA.
Enable Gradient Checkpointing: This reduces memory consumption by recomputing parts of the graph during the backward pass rather than storing all intermediate activations. Useful when dealing with large models. from torch.utils.checkpoint import checkpoint
Optimize Batch Size: Large batch sizes consume more memory. Reducing the batch size helps prevent memory overflows.
Detach Unnecessary Tensors: Use detach() to prevent PyTorch from retaining computation graphs for tensors that no longer require gradient tracking.
tensor = tensor.detach()
Use torch.cuda.empty_cache() Regularly: In GPU operations, periodically clearing the cache can help release memory back to the GPU and prevent leaks.
torch.cuda.empty_cache()
Monitor Memory Usage: Use PyTorch's memory profiling tools to track and profile memory usage during the training process.
Check for Redundant Hessian Calculations: Since AdaHessian performs second-order derivative calculations, ensure they are optimized and not needlessly repeated, as these calculations are memory-intensive.