lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.85k stars 4.54k forks source link

CUDA out of memory in CLI vicuna 7B #657

Open mpetruc opened 1 year ago

mpetruc commented 1 year ago

Running inference using vicuna 7B on a 16Gb 3080. Occasionally the script crashes with an error like: RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 16.00 GiB total capacity; 13.69 GiB already allocated; 0 bytes free; 13.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

I modified the modelling_llama.py by adding import os os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:2000' also tried with 'max_split_size_mb:4000'

Any suggestions for addressing this issue? Thank you.

Chesterguan commented 1 year ago

have you tried with --load-8bit ? Even the GPU has 16GB memory, you can only use around 85% of it.

ivangabriele commented 1 year ago

I'm trying to run FastChat in a CUDA Docker Image and I have the same issue with an RTX 2070 8Gb:

OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 7.78 GiB total capacity; 6.31 GiB already allocated; 62.44 MiB free; 6.31 GiB reserved in total by PyTorch) If reserved memory is >> 
allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried to run it with --load-8bit but I have the same error, command:

python3 -m fastchat.serve.cli --load-8bit --model-path /app/models/vicuna-7b
surak commented 1 year ago

@mpetruc this looks like some other process took over GPU memory. Did you check with nvidia-smi if there was something there?

Is it still an issue for you?