huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

can't start server with small --max-total-tokens. But works fine with big stting #2246

Closed rooooc closed 2 weeks ago

rooooc commented 1 month ago

when I try to run CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 64--max-total-tokens 128 --max-batch-prefill-tokens 128 --cuda-memory-fraction 0.95. It says

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU has a total capacity of 44.53 GiB of which 1.94 MiB is free. Process 123210 has 44.52 GiB memory in use. Of the allocated memory 40.92 GiB is allocated by PyTorch, and 754.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

But for sitting big max tokens. CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 2048 --cuda-memory-fraction 0.95. it works fine.

i don't get it why small max tokens cause CUDA out of memory but large max tokens works fine. Can someone answer my questions?

Hugoch commented 1 month ago

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

rooooc commented 1 month ago

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

ok. i got it. but why when the max-total-tokens is large, like 2048, it works ok. when it comes to 64, its doesn't work? i am not setting max batch total tokens for both of them

rooooc commented 1 month ago

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

i have set the max batch total tokens, its still not working.

Hugoch commented 1 month ago

@rooooc , you should be able to reduce the max-batch-total-tokens until you have an acceptable value for your GPU memory. As stated in doc:

Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded).

If you OOM, it should be reduced further.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.