Cannot inference on a A40 GPU with CUDA OOM error

Many thanks to your great work ! I try to inference the MU-LLama model on an A40 48G RAM GPU using the python script provided, but it turns out to get the following error. torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.27 GiB (GPU 0; 44.35 GiB total capacity; 30.87 GiB already allocated; 13.06 GiB free; 30.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I tried to (1) set PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128" (2) decrease the llama parameter of llama in llama_adaptor.py from max_batch_size=32 to 1 and decrease max_seq_len from 8192 to 256, but the error still came out.

shansongliu / MU-LLaMA

Cannot inference on a A40 GPU with CUDA OOM error #12