Open mapa17 opened 1 year ago
Hi, can anyone confirm that they have tried AWQ and maybe which 7B or 13B model worked for them? Can you recall what memory footprint the model loading had?
I get the same issue with TheBloke/CodeLlama-7B-AWQ and TheBloke/WizardCoder-Python-7B-V1.0-AWQ on my 3080 with 10GB vram. I can however, run TheBloke/UltraLM-13B-v2.0-AWQ without getting OOM.
Possibly shedding some light: I am able to solve the error withing AutoAWQ by setting the fuse_layers parameter to False.
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)
I tested it for both TheBloke/CodeLlama-7B-AWQ and TheBloke/tulu-7B-AWQ. Below is the full example:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "TheBloke/tulu-7B-AWQ"
quant_file = "model.safetensors"
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, safetensors=True, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:"""
tokens = tokenizer(
prompt_template.format(prompt="How are you today?"),
return_tensors='pt'
).input_ids.cuda()
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
vllm unfortunately does not use AutoAWQForCausalLM under the hood so this cannot be an immediate fix. The issue seemingly lies withing the awq_gemm function located in the vllm/csrc/quantization/awq/gemm_kernels.cu file.
Great! Thank you. I will try to have a look at it too, although I unfortunately understand very little about cuda programming. Besides TheBloke/UltraLM-13B-v2.0-AWQ, can you recommend other models that dont seem to have this issue?
There seems to be overlaps with #1236 and some progress on the topic by limiting the max context_length. But if I am not mistaken, the high memory footprint of AWQ models is present there too.
The fused modules should also be fixed for VRAM issues now btw. I implemented a fix that saves 2GB VRAM in AutoAWQ, but this is not related to any memory issues in vLLM
i have 24gb of ram using L4 gpu. still getting the same oom error
Yes, i can confirm that with version v0.2.1 trying to load https://huggingface.co/casperhansen/mistral-7b-instruct-v0.1-awq i am still running OOM with
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 GiB (GPU 0; 11.76 GiB total capacity; 4.87 GiB already allocated; 5.42 GiB free; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
You should limit the maximum new tokens in vLLM to avoid OOM. Please read through the input arguments to the LLM engine to find exactly how it works.
Try max_model_len=512 for instance. You are trying to load 32k into cache, so you need to specify input arguments correctly for vLLM
Hello @casper-hansen you are correct. Setting --max-model-len 512 makes it possible for me to load the mistral-7b-awq model. But just loading the model takes 11gB of vram (11236MiB / 12288MiB) Maybe I am confused, but I thought for that awq 4b compression it should be much lower.
I’m not quite sure how vLLM allocates memory. In AutoAWQ, we only allocate the cache you ask for and it will definitely not take up 11GB VRAM for 512 tokens.
I run the following command, 'Out of Memory'. Does anyone know what's going on? python -m vllm.entrypoints.api_server --model /root/autodl-tmp/Yi-6B-200K-AWQ --quantization awq --trust-remote-code
@exceedzhang try to limit the max sequence length with --max-model-len XXXX
Hello guys,
I was able to load my fine-tuned version of mistral-7b-v0.1-awq
quantized with autoawq
on my 24Gb TITAN RTX, and it’s using almost 21Gb of the 24Gb. This is huge, because using transformers
with autoawq
uses 7Gb of my GPU, does someone knows how to reduce it? The "solution" is done by increasing --max-model-len
?
Notes, setting:
--max-model-len 512
uses 22Gb
--max-model-len 4096
uses 21Gb
--max-model-len 8192
uses 18Gb
$ CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model "./models/mistral-7b-v0.1-awq" --quantization awq --dtype half --max-model-len 4096
INFO 11-16 06:50:58 api_server.py:615] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='./models/mistral-7b-v0.1-awq', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 11-16 06:50:58 llm_engine.py:72] Initializing an LLM engine with config: model='./models/mistral-7b-v0.1-awq', tokenizer='./models/mistral-7b-v0.1-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 11-16 06:51:09 llm_engine.py:207] # GPU blocks: 7724, # CPU blocks: 2048
INFO: Started server process [1027898]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
$ nvidia-smi
Thu Nov 16 06:53:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN V Off| 00000000:09:00.0 Off | N/A |
| 36% 52C P8 28W / 250W| 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN RTX Off| 00000000:42:00.0 Off | N/A |
| 40% 47C P8 21W / 280W| 20899MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 1027898 C python3 20888MiB |
+---------------------------------------------------------------------------------------+
This is a bit strange... I know, vLLM (or XRay) reserves GPU memory on worm up. but why lerger --max-model-len is efficient?? which parameter is Most efficient for vLLM????
I repro with on my 14Gb Tesla T4
I'm facing this behavior as well.vllm occupies 17GB Ram when running mistral-7b-awq model.
This later on causes OOM on long prompts (3k+ tokens) during runtime.
Any idea how to fix that behavior
I recently tried AWQ quantized Mistral-7B with TGI using around 8gB to load the model instead of 11.2GB with vLLM. I guess there is something strange happening in vLLM with AWQ quantized models.
I've been having the same issue and someone on TheBloke's Discord channel said it might be because AWQ uses batched inference and on startup grabs as much memory as is available. Not had this verified or found a way to limit the amount of memory via parameters to TGI. Could really do with a clear answer on running AWQ on TGI without it grabbing all the memory
I have not tried it myself, because I am using TGI for now, but I saw in a video that one seems to avoid the issue adding the option: --dtype half
.
Maybe thats worth a try.
Any news on this?
As of today, running --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq --max-model-len 256
results in 23146M
of usage, whilst using --model mistralai/Mistral-7B-v0.1 --max-model-len 256
uses 22736M
, so there seems to be an issue with AWQ I guess (eventhough both model may differ in memory usage of course) 🤔
I have included an example of using vLLM in the AutoAWQ documentation. I am not sure why vLLM is as memory hungry as you suggest, but I would try this working example first to figure out if the problem persists or if it has to do with certain arguments being passed to the engine.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Hi,
i am trying to make use of the AWQ quantization to try to load 7B LLama based models onto my RTX 3060 with 12 GB. This fails OOM for models like https://huggingface.co/TheBloke/leo-hessianai-7B-AWQ . I was able to load https://huggingface.co/TheBloke/tulu-7B-AWQ with its 2k seq length taking up 11.2GB of my ram.
My expectation was that these 7B models with AWQ quantization with GEMM would need for inference around ~ 3.5 gB to load.
I tried to load the models from within my app using vLLM as a lib and following Brokes instructions with
Do I miss something here?
Thx, Manuel