Open stikkireddy opened 7 months ago
I had a similar issue before, you can try setting gpu_memory_utilization to a lower value like 0.5, the default value is 0.9.
the model requires 2 gpus to run its llama 70b fp16, i need the actors to be able to shard between the two gpus. The problem is less the oom, the problem is that two gpus are available and its not using both.
I am unsure why gpu_memory_utilization will work when running on a single node it works just fine with tensor-parallel-size.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
I am running llama 70b and i want do deploy multiple model instances based on the number of ray worker nodes and I am getting this issue. I am using the examples provided by ray repo and it works with 1 gpu but 2 x A100s are not working. Can you please assist on this? It does not seem to be using the 2 gpus and i confirmed with the commented code that there are actually two gpus.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 369.25 MiB is free. Process 86114 has 78.78 GiB memory in use. Of the allocated memory 78.15 GiB is allocated by PyTorch, and 1.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF