Closed rooooc closed 2 weeks ago
Hello @rooooc!
Your issue probably relates to not setting max-batch-total-tokens
(https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens
and max-batch-prefill-tokens
you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.
Hello @rooooc!
Your issue probably relates to not setting
max-batch-total-tokens
(https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values formax-total-tokens
andmax-batch-prefill-tokens
you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.
ok. i got it. but why when the max-total-tokens is large, like 2048, it works ok. when it comes to 64, its doesn't work? i am not setting max batch total tokens for both of them
Hello @rooooc!
Your issue probably relates to not setting
max-batch-total-tokens
(https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values formax-total-tokens
andmax-batch-prefill-tokens
you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.
i have set the max batch total tokens, its still not working.
@rooooc , you should be able to reduce the max-batch-total-tokens
until you have an acceptable value for your GPU memory. As stated in doc:
Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded).
If you OOM, it should be reduced further.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
when I try to run CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 64--max-total-tokens 128 --max-batch-prefill-tokens 128 --cuda-memory-fraction 0.95. It says
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU has a total capacity of 44.53 GiB of which 1.94 MiB is free. Process 123210 has 44.52 GiB memory in use. Of the allocated memory 40.92 GiB is allocated by PyTorch, and 754.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
But for sitting big max tokens. CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 2048 --cuda-memory-fraction 0.95. it works fine.
i don't get it why small max tokens cause CUDA out of memory but large max tokens works fine. Can someone answer my questions?