Open nwoyecid opened 1 year ago
This error occurs only during inference/evaluation and not during training.
I can infer fine, can you share the code ?
I guess there could be many factors for occurrence of the OOM error:- Like if you are using techniques like beam search in NLP models or large batch sizes for evaluation. Secondly, Memory fragmentation can lead to inefficient use of GPU memory. Try using this:-
import torch
torch.cuda.set_per_process_memory_fraction(0.9) # Adjust the fraction as needed
torch.backends.cuda.matmul.allow_tf32 = True # Enabling TF32 precision to save memory
or you may even try to clean up the cache before starting inference by this:-
import torch
torch.cuda.empty_cache()
provide the code for more detailed help Thanks
I have OOM error during inference but not during training.
This happens even with batch size of 1 and even with increasing the GPU memory.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.25 GiB (GPU 0; 44.42 GiB total capacity; 36.96 GiB already allocated; 3.95 GiB free; 38.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I think its not OOM issue but something wrong with the Trainer increasing reserving memory.