bonuschild commented 10 months ago

Test on llm-vscode-inference-server

I use project llm-vscode-inference-server, which inherits from vllm, to load model weight from CodeLlama-7B-AWQ with command:

python api_server.py --trust-remote-code --model ../CodeLlama-7B-AWQ --quantization awq --dtype half --max-model-len 512

And output:

WARNING 10-26 12:34:54 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:34:54 llm_engine.py:72] Initializing an LLM engine with config: model='../CodeLlama-7B-AWQ', tokenizer='../CodeLlama-7B-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:34:54 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

Then output after about 5 minutes:

INFO 10-26 12:39:51 llm_engine.py:207] # GPU blocks: 793, # CPU blocks: 512
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 8.49 GiB already allocated; 1.53 GiB free; 8.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've set the PYTORCH_CUDA_ALLOC_CONF via command before I execute the run command above but still got error:

 set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:100

Test on vllm

Simply change command to:

python3 python -m vllm.entrypoints.openai.api_server --model TheBloke/CodeLlama-7B-Python-AWQ --quantization awq -dtype half

if without -dtype half it raise error like:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

and output:

WARNING 10-26 12:44:31 config.py:346] Casting torch.bfloat16 to torch.float16.
INFO 10-26 12:44:31 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-7B-AWQ/', tokenizer='./CodeLlama-7B-AWQ/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-26 12:44:31 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.

then error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 0; 12.00 GiB total capacity; 4.17 GiB already allocated; 5.94 GiB free; 4.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Resources Usage

Before I execute the command, My RTX 3060 VRAM is 1.5/12GB; After executed it raises to 6.0/12GB then throw out error after about 5minutes saying that OutOfMemoryError.

Question

I just confused that why the AWQ model size is only <=4GB but can not run on the NVIDIA RTX 3060 with 12GB VRAM...

amir-in-a-cynch commented 10 months ago

When testing on vllm, did you try --max-model-len 512?It looks from your output that it went to 16384.

bonuschild commented 10 months ago

        I've also add `--max-model-len=512` when test on vllm:```bashpython -m vllm.entrypoints.openai.api_server --model ./CodeLlama-7B-AWQ --max-model-len=512 -q awq --dtype half```It outputs same like tests above and occupy same VRAM usage: about 6.3GB/12.0GB. And then throw error with VRAM usage dropping down:```torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 12.00 GiB total capacity; 6.93 GiB already allocated; 3.09 GiB free; 6.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

---- Replied Message ----

     From 

        ***@***.***>

     Date 

    10/27/2023 02:08

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        ***@***.***>

     Subject 

          Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479)

When testing on vllm, did you try --max-model-len 512?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

casper-hansen commented 10 months ago

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

bonuschild commented 10 months ago

        Yes, the `--max-model-len` was passed from CLI but converted into `--max-seq-len` in vLLM api.  You can verify and try on your vLLM. I've try with 512 many times but won't work ...

---- Replied Message ----

     From 

        ***@***.***>

     Date 

    10/28/2023 17:23

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        ***@***.***>

     Subject 

          Re: [vllm-project/vllm] Running out of memory with TheBloke/CodeLlama-7B-AWQ (Issue #1479)

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

bonuschild commented 10 months ago

Isn't it --max_model_len or am I mistaken? Btw, the 7B model should definitely fit into 512 context.

I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that? I use command:

python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code

That was so weired...

SupreethRao99 commented 10 months ago

second @bonuschild 's error output , trying to run mistral-7b on a T4 with 16GB VRAM after I've quantised it with AWQ still causes CUDA OOM errors.

sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=2048)
llm = LLM(model="zephyr-7b-beta-awq", quantization="awq", dtype="float16")
outputs = llm.generate(prompts, sampling_params)

bonuschild commented 10 months ago

@amir-in-a-cynch @casper-hansen @SupreethRao99 @tmm1

I use the AWQ model made by @TheBloke from https://huggingface.co/TheBloke/CodeLlama-7B-AWQ

with its instruct I should run this command:

python -m vllm.entrypoints.api_server --model TheBloke/CodeLlama-7B-AWQ --quantization awq

but raise datatype not supported error:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

So I change command and rerun:

python -m vllm.entrypoints.openai.api_server --model path/to/CodeLlama-7B-AWQ -q awq --dtype half

and raise OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 12.00 GiB total capacity; 3.38 GiB already allocated; 6.99 GiB free; 3.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

this time seems is really not enough VRAM...

So why a 4GB AWQ model require more than 12GB VRAM to run?

My GPU Card is RTX 3060(12GB VRAM).
I've tested on A100 with 40GB VRAM and running this model with vllm actually takes about 22GB VRAM

slobodaapl commented 10 months ago

Can confirm this is an issue, tried on A100 with normal vLLM, no fork, also facing the same issue.

thr3a commented 10 months ago

Same issue

Python 3.11
Docker nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
transformers==4.35.0
torch==2.0.1+cu118
torchvision==0.15.2+cu118

manishiitg commented 10 months ago

try adding --gpu-memory-utilization .8 this worked for me

slobodaapl commented 10 months ago

I think I found a potential issue and solution. This is specifically because of vLLM works.

Setting the 'max_batch_tokens' (I think is the name) too high causes the KV cache to be too big. It directly influences the GPU memory occupied for some reason. Try setting your max_batch_tokens to like 32k while keeping everything else the same.

This fixed it for me.

SupreethRao99 commented 10 months ago

@slobodaapl , could you be more exact with the parameter that should be changed, I don't seem to find anything similar to max_batch_tokens in the arguments for SamplingParams, or the LLM class.

Thank you!

slobodaapl commented 10 months ago

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:

python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768

When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

bonuschild commented 10 months ago

@SupreethRao99 Sorry had to find it. This is how I start my VLLM OpenAI Server:
python -m vllm.entrypoints.openai.api_server \
        --served-model $MODEL_ID \
        --model $MODEL_ID \
        --tensor-parallel-size 4 \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768
When I reduced the max-num-batched-tokens down to 32768 from a high number I had previously, I no longer experience CUDA memory errors. Try setting your low as well, see if it helps.

I use the command as you provided but still cost 21GB VRAM when loading a 7B-AWQ model :(

python -m vllm.entrypoints.openai.api_server \
        --model $MODEL_ID \
        --host 0.0.0.0 \
        --port 8080 \
        --max-num-batched-tokens 32768 \
        -q awq --dtype half --trust-remote-code

slobodaapl commented 10 months ago

For those with very limited VRAM try setting the batched tokens to about 4-8k, and combine it with the memory limit parameter to about 0.8.

Also, try the non-quantised version with this first. It seems vLLM uses extra memory to do some kind of operation on the model when loading it quantised.

gesanqiu commented 10 months ago

Except --max-num-batched-tokens and --gpu-memory-utilization, I also limit the --max-num-seqs.

demegire commented 10 months ago

For anybody stumbling here, be sure to check max_seq_len, for some reason the default was 32768 in TheBloke/zephyr-7B-beta-AWQ

bonuschild commented 10 months ago

@demegire Agreed and need to findout the correct max sequence length, which is normally not mentioned in the model card :)

Jaykumaran commented 8 months ago

!python -u -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \ --dtype half \ --max-num-batched-tokens 4096 \ --max-model-len 256 \ --quantization awq \ --tensor-parallel-size 1 \ --port 8010 | grep -q "Uvicorn running" &npx localtunnel --port 8010

  worked in T4 colab  cuda 12.2

Jaykumaran commented 8 months ago

Does anyone have idea about using RAGAS EVALUATION using VLLM server for HuggingFace models.I was successfully serving my model,but ragas metrics evaluation does recognise vlllm serving and it always asks for open ai api key

evaluate

from ragas import evaluate

result = evaluate( fiqa_eval["baseline"].select(range(5)), # showing only 5 for demonstration metrics=[faithfulness], )

result

OpenAIKeyNotFound: OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable

vllm-project / vllm

Running out of memory with TheBloke/CodeLlama-7B-AWQ #1479

Test on llm-vscode-inference-server

Test on vllm

System Resources Usage

Question

evaluate