Closed bonuschild closed 5 months ago
When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.
I've also add `--max-model-len=512` when test on vllm:```bashpython -m vllm.entrypoints.openai.api_server --model ./CodeLlama-7B-AWQ --max-model-len=512 -q awq --dtype half```It outputs same like tests above and occupy same VRAM usage: about 6.3GB/12.0GB. And then throw error with VRAM usage dropping down:```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 6.93 GiB already allocated; 3.09 GiB free; 6.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
---- Replied Message ----
From
***@***.***>
Date
10/27/2023 02:08
To
***@***.***>
Cc
***@***.***>
,
***@***.***>
Subject
Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479)
When testing on vllm, did you try --max-model-len 512?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Isn't it --max_model_len
or am I mistaken? Btw, the 7B model should definitely fit into 512 context.
Yes, the `--max-model-len` was passed from CLI but converted into `--max-seq-len` in vLLM api. You can verify and try on your vLLM. I've try with 512 many times but won't work ...
---- Replied Message ----
From
***@***.***>
Date
10/28/2023 17:23
To
***@***.***>
Cc
***@***.***>
,
***@***.***>
Subject
Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479)
Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
Isn't it
--max_model_len
or am I mistaken? Btw, the 7B model should definitely fit into 512 context.
I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that? I use command:
python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code
That was so weired...
second @bonuschild 's error output , trying to run mistral-7b on a T4 with 16GB VRAM after I've quantised it with AWQ still causes CUDA OOM errors.
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=2048)
llm = LLM(model="zephyr-7b-beta-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)
@amir-in-a-cynch @casper-hansen @SupreethRao99 @tmm1
I use the AWQ model made by @TheBloke from https://huggingface.co/TheBloke/CodeLlama-7B-AWQ
with its instruct I should run this command:
python -m vllm.entrypoints.api_server --model TheBloke/CodeLlama-7B-AWQ --quantization awq
but raise datatype not supported
error:
ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]
So I change command and rerun:
python -m vllm.entrypoints.openai.api_server --model path/to/CodeLlama-7B-AWQ -q awq --dtype half
and raise OOM
error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 12.00 GiB total capacity; 3.38 GiB already allocated; 6.99 GiB free; 3.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
this time seems is really not enough VRAM...
So why a 4GB AWQ model require more than 12GB VRAM to run?
Can confirm this is an issue, tried on A100 with normal vLLM, no fork, also facing the same issue.
Same issue
try adding --gpu-memory-utilization .8 this worked for me
I think I found a potential issue and solution. This is specifically because of vLLM works.
Setting the 'max_batch_tokens' (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.
This fixed it for me.
@slobodaapl , could you be more exact with the parameter that should be changed, I don't seem to find anything similar to max_batch_tokens in the arguments for SamplingParams, or the LLM class.
Thank you!
@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
python -m vllm.entrypoints.openai.api_server \
--served-model $MODEL_ID \
--model $MODEL_ID \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8080 \
--max-num-batched-tokens 32768
When I reduced the max-num-batched-tokens
down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.
@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
python -m vllm.entrypoints.openai.api_server \ --served-model $MODEL_ID \ --model $MODEL_ID \ --tensor-parallel-size 4 \ --host 0.0.0.0 \ --port 8080 \ --max-num-batched-tokens 32768
When I reduced the
max-num-batched-tokens
down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.
I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model :(
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_ID \
--host 0.0.0.0 \
--port 8080 \
--max-num-batched-tokens 32768 \
-q awq --dtype half --trust-remote-code
For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.
Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.
Except --max-num-batched-tokens
and --gpu-memory-utilization
, I also limit the --max-num-seqs
.
For anybody stumbling here, be sure to check max_seq_len, for some reason the default was 32768 in TheBloke/zephyr-7B-beta-AWQ
@demegire Agreed and need to findout the correct max
sequence length, which is normally not mentioned in the model card :)
!python -u -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \ --dtype half \ --max-num-batched-tokens 4096 \ --max-model-len 256 \ --quantization awq \ --tensor-parallel-size 1 \ --port 8010 | grep -q "Uvicorn running" &npx localtunnel --port 8010
worked in T4 colab cuda 12.2
Does anyone have idea about using RAGAS EVALUATION using VLLM server for HuggingFace models.I was successfully serving my model,but ragas metrics evaluation does recognise vlllm serving and it always asks for open ai api key
from ragas import evaluate
result = evaluate( fiqa_eval["baseline"].select(range(5)), # showing only 5 for demonstration metrics=[faithfulness], )
result
OpenAIKeyNotFound: OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable
Test on llm-vscode-inference-server
I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:
And output:
Then output after about 5 minutes:
I've set the
PYTORCH_CUDA_ALLOC_CONF
via command before I execute the run command above but still got error:Test on vllm
Simply change command to:
and output:
then error:
System Resources Usage
Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that
OutOfMemoryError
.Question
I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM...