Running out of memory loading 7B AWQ quantized models with 12GB vram

mapa17 commented 1 year ago

Hi,

i am trying to make use of the AWQ quantization to try to load 7B LLama based models onto my RTX 3060 with 12 GB. This fails OOM for models like https://huggingface.co/TheBloke/leo-hessianai-7B-AWQ . I was able to load https://huggingface.co/TheBloke/tulu-7B-AWQ with its 2k seq length taking up 11.2GB of my ram.

My expectation was that these 7B models with AWQ quantization with GEMM would need for inference around ~ 3.5 gB to load.

I tried to load the models from within my app using vLLM as a lib and following Brokes instructions with

python -m vllm.entrypoints.api_server --model TheBloke/tulu-7B-AWQ --quantization awq

Do I miss something here?

Thx, Manuel

mapa17 commented 1 year ago

Hi, can anyone confirm that they have tried AWQ and maybe which 7B or 13B model worked for them? Can you recall what memory footprint the model loading had?

elslush commented 1 year ago

I get the same issue with TheBloke/CodeLlama-7B-AWQ and TheBloke/WizardCoder-Python-7B-V1.0-AWQ on my 3080 with 10GB vram. I can however, run TheBloke/UltraLM-13B-v2.0-AWQ without getting OOM.

elslush commented 1 year ago

Possibly shedding some light: I am able to solve the error withing AutoAWQ by setting the fuse_layers parameter to False.

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)

I tested it for both TheBloke/CodeLlama-7B-AWQ and TheBloke/tulu-7B-AWQ. Below is the full example:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "TheBloke/tulu-7B-AWQ"
quant_file = "model.safetensors"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)

vllm unfortunately does not use AutoAWQForCausalLM under the hood so this cannot be an immediate fix. The issue seemingly lies withing the awq_gemm function located in the vllm/csrc/quantization/awq/gemm_kernels.cu file.

mapa17 commented 1 year ago

Great! Thank you. I will try to have a look at it too, although I unfortunately understand very little about cuda programming. Besides TheBloke/UltraLM-13B-v2.0-AWQ, can you recommend other models that dont seem to have this issue?

mapa17 commented 1 year ago

There seems to be overlaps with #1236 and some progress on the topic by limiting the max context_length. But if I am not mistaken, the high memory footprint of AWQ models is present there too.

casper-hansen commented 1 year ago

The fused modules should also be fixed for VRAM issues now btw. I implemented a fix that saves 2GB VRAM in AutoAWQ, but this is not related to any memory issues in vLLM

manishiitg commented 1 year ago

i have 24gb of ram using L4 gpu. still getting the same oom error

mapa17 commented 1 year ago

Yes, i can confirm that with version v0.2.1 trying to load https://huggingface.co/casperhansen/mistral-7b-instruct-v0.1-awq i am still running OOM with

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 11.76 GiB total capacity; 4.87 GiB already allocated; 5.42 GiB free; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

casper-hansen commented 1 year ago

You should limit the maximum new tokens in vLLM to avoid OOM. Please read through the input arguments to the LLM engine to find exactly how it works.

casper-hansen commented 1 year ago

Try max_model_len=512 for instance. You are trying to load 32k into cache, so you need to specify input arguments correctly for vLLM

mapa17 commented 1 year ago

Hello @casper-hansen you are correct. Setting --max-model-len 512 makes it possible for me to load the mistral-7b-awq model. But just loading the model takes 11gB of vram (11236MiB / 12288MiB) Maybe I am confused, but I thought for that awq 4b compression it should be much lower.

casper-hansen commented 1 year ago

I’m not quite sure how vLLM allocates memory. In AutoAWQ, we only allocate the cache you ask for and it will definitely not take up 11GB VRAM for 512 tokens.

exceedzhang commented 1 year ago

I run the following command, 'Out of Memory'. Does anyone know what's going on? python -m vllm.entrypoints.api_server --model /root/autodl-tmp/Yi-6B-200K-AWQ --quantization awq --trust-remote-code

mapa17 commented 12 months ago

@exceedzhang try to limit the max sequence length with --max-model-len XXXX

andrewssobral commented 12 months ago

Hello guys,

I was able to load my fine-tuned version of mistral-7b-v0.1-awq quantized with autoawq on my 24Gb TITAN RTX, and it’s using almost 21Gb of the 24Gb. This is huge, because using transformers with autoawq uses 7Gb of my GPU, does someone knows how to reduce it? The "solution" is done by increasing --max-model-len?

Notes, setting:

--max-model-len 512 uses 22Gb
--max-model-len 4096 uses 21Gb
--max-model-len 8192 uses 18Gb

$ CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model "./models/mistral-7b-v0.1-awq" --quantization awq --dtype half --max-model-len 4096
INFO 11-16 06:50:58 api_server.py:615] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='./models/mistral-7b-v0.1-awq', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 11-16 06:50:58 llm_engine.py:72] Initializing an LLM engine with config: model='./models/mistral-7b-v0.1-awq', tokenizer='./models/mistral-7b-v0.1-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 11-16 06:51:09 llm_engine.py:207] # GPU blocks: 7724, # CPU blocks: 2048
INFO:     Started server process [1027898]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

$ nvidia-smi
Thu Nov 16 06:53:03 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN V                  Off| 00000000:09:00.0 Off |                  N/A |
| 36%   52C    P8               28W / 250W|      0MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX                Off| 00000000:42:00.0 Off |                  N/A |
| 40%   47C    P8               21W / 280W|  20899MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A   1027898      C   python3                                   20888MiB |
+---------------------------------------------------------------------------------------+

s-natsubori commented 12 months ago

This is a bit strange... I know, vLLM (or XRay) reserves GPU memory on worm up. but why lerger --max-model-len is efficient?? which parameter is Most efficient for vLLM????

I repro with on my 14Gb Tesla T4

max-model-len 2k uses 14Gb
max-model-len 4k uses 12Gb
max-model-len 8k uses 10Gb
max-model-len 16k OoM

matankley commented 11 months ago

I'm facing this behavior as well.vllm occupies 17GB Ram when running mistral-7b-awq model.

This later on causes OOM on long prompts (3k+ tokens) during runtime.

Any idea how to fix that behavior

mapa17 commented 11 months ago

I recently tried AWQ quantized Mistral-7B with TGI using around 8gB to load the model instead of 11.2GB with vLLM. I guess there is something strange happening in vLLM with AWQ quantized models.

kulbinderdio commented 10 months ago

I've been having the same issue and someone on TheBloke's Discord channel said it might be because AWQ uses batched inference and on startup grabs as much memory as is available. Not had this verified or found a way to limit the amount of memory via parameters to TGI. Could really do with a clear answer on running AWQ on TGI without it grabbing all the memory

mapa17 commented 10 months ago

I have not tried it myself, because I am using TGI for now, but I saw in a video that one seems to avoid the issue adding the option: --dtype half.

Maybe thats worth a try.

qdm12 commented 8 months ago

Any news on this?

As of today, running --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq --max-model-len 256 results in 23146M of usage, whilst using --model mistralai/Mistral-7B-v0.1 --max-model-len 256 uses 22736M, so there seems to be an issue with AWQ I guess (eventhough both model may differ in memory usage of course) 🤔

casper-hansen commented 8 months ago

I have included an example of using vLLM in the AutoAWQ documentation. I am not sure why vLLM is as memory hungry as you suggest, but I would try this working example first to figure out if the problem persists or if it has to do with certain arguments being passed to the engine.

https://casper-hansen.github.io/AutoAWQ/examples/#vllm

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

vllm-project / vllm

Running out of memory loading 7B AWQ quantized models with 12GB vram #1234