Open mpetruc opened 1 year ago
have you tried with --load-8bit ? Even the GPU has 16GB memory, you can only use around 85% of it.
I'm trying to run FastChat in a CUDA Docker Image and I have the same issue with an RTX 2070 8Gb:
OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 7.78 GiB total capacity; 6.31 GiB already allocated; 62.44 MiB free; 6.31 GiB reserved in total by PyTorch) If reserved memory is >>
allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I tried to run it with --load-8bit
but I have the same error, command:
python3 -m fastchat.serve.cli --load-8bit --model-path /app/models/vicuna-7b
@mpetruc this looks like some other process took over GPU memory. Did you check with nvidia-smi if there was something there?
Is it still an issue for you?
Running inference using vicuna 7B on a 16Gb 3080. Occasionally the script crashes with an error like: RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 16.00 GiB total capacity; 13.69 GiB already allocated; 0 bytes free; 13.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
I modified the modelling_llama.py by adding import os os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:2000' also tried with 'max_split_size_mb:4000'
Any suggestions for addressing this issue? Thank you.