vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.03k stars 4.54k forks source link

Mixtral AWQ uses massive amount of memory when using its long context, GPU OOM for 2*A100 80GB while normal Mixtral has no issues. #2631

Closed pseudotensor closed 6 months ago

pseudotensor commented 9 months ago

vllm 0.2.7 with cuda 12.1.

python  -m vllm.entrypoints.openai.api_server         --port=5002         --host=0.0.0.0         --model=TheBloke/dolphin-2.7-mixtral-8x7b-AWQ         --seed 1234         --trust-remote-code         --quantization awq         --tensor-parallel-size=2

2*A100 80GB, but when using longer context like filling 31k (leaving 1k for output), leads to GPU OOM. nvidia-smi shows using about 76GB per GPU up to that point.

Makes AWQ in vLLM totally useless, since normal non-AWQ Mixtral works perfectly fine for exact same usage pattern.

awq.txt

I understand that if context is large that dominates GPU memory usage, but it can't be more than 16-bit model.

pseudotensor commented 9 months ago

Hi @casper-hansen Any idea here? Thanks!

casper-hansen commented 9 months ago

I’m not sure if this is a tensor parallel bug or just worse performance for large sequences. Either way, you should test out the main branch as I just got a PR merged that uses a different strategy for prefilling.

pseudotensor commented 9 months ago

Will do, thanks. I understand a long sequence can use more memory, but FP16 mixtral has no failures and uses much less memory for same sequences than AWQ one. I expect AWQ to only be better with memory, but can't be worse I'd guess.

umarbutler commented 9 months ago

@pseudotensor Have you tried --enforce-eager?

rahuja23 commented 9 months ago

Any idea why this is happening? I tried deploying mistralai/Mixtral-8x7B-v0.1 on OpenShift cluster node having 2 A100 80Gi GPU(s) and I get the following error:

2024-02-07 10:10:59 | ERROR | stderr | Traceback (most recent call last):
2024-02-07 10:10:59 | ERROR | stderr |   File "/app/vllm_api.py", line 54, in <module>
2024-02-07 10:10:59 | ERROR | stderr |     worker, engine = create_vllm_worker(app_config)
2024-02-07 10:10:59 | ERROR | stderr |   File "/app/src/vllm_worker.py", line 190, in create_vllm_worker
2024-02-07 10:10:59 | ERROR | stderr |     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-02-07 10:10:59 | ERROR | stderr |     engine = cls(parallel_config.worker_use_ray,
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-02-07 10:10:59 | ERROR | stderr |     self.engine = self._init_engine(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-02-07 10:10:59 | ERROR | stderr |     return engine_class(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 108, in __init__
2024-02-07 10:10:59 | ERROR | stderr |     self._init_workers_ray(placement_group)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
2024-02-07 10:10:59 | ERROR | stderr |     self._run_workers(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 755, in _run_workers
2024-02-07 10:10:59 | ERROR | stderr |     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
2024-02-07 10:10:59 | ERROR | stderr |     all_outputs = ray.get(all_outputs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
2024-02-07 10:10:59 | ERROR | stderr |     return fn(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2024-02-07 10:10:59 | ERROR | stderr |     return func(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2624, in get
2024-02-07 10:10:59 | ERROR | stderr |     raise value.as_instanceof_cause()
2024-02-07 10:10:59 | ERROR | stderr | ray.exceptions.RayTaskError(OSError): [36mray::RayWorkerVllm.execute_method()[39m (pid=793, ip=10.131.2.143, actor_id=2104b320853cd220669ab4da01000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7fd83257e310>)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/engine/ray_utils.py", line 31, in execute_method
2024-02-07 10:10:59 | ERROR | stderr |     return executor(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/worker.py", line 79, in load_model
2024-02-07 10:10:59 | ERROR | stderr |     self.model_runner.load_model()
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/worker/model_runner.py", line 57, in load_model
2024-02-07 10:10:59 | ERROR | stderr |     self.model = get_model(self.model_config)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/model_loader.py", line 72, in get_model
2024-02-07 10:10:59 | ERROR | stderr |     model.load_weights(model_config.model, model_config.download_dir,
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/models/mixtral.py", line 407, in load_weights
2024-02-07 10:10:59 | ERROR | stderr |     for name, loaded_weight in hf_model_weights_iterator(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/weight_utils.py", line 198, in hf_model_weights_iterator
2024-02-07 10:10:59 | ERROR | stderr |     hf_folder, hf_weights_files, use_safetensors = prepare_hf_model_weights(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/weight_utils.py", line 155, in prepare_hf_model_weights
2024-02-07 10:10:59 | ERROR | stderr |     hf_folder = snapshot_download(model_name_or_path,
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-02-07 10:10:59 | ERROR | stderr |     return fn(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/_snapshot_download.py", line 308, in snapshot_download
2024-02-07 10:10:59 | ERROR | stderr |     thread_map(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
2024-02-07 10:10:59 | ERROR | stderr |     return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
2024-02-07 10:10:59 | ERROR | stderr |     return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/tqdm/std.py", line 1170, in __iter__
2024-02-07 10:10:59 | ERROR | stderr |     for obj in iterable:
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
2024-02-07 10:10:59 | ERROR | stderr |     yield fs.pop().result()
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
2024-02-07 10:10:59 | ERROR | stderr |     return self.__get_result()
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
2024-02-07 10:10:59 | ERROR | stderr |     raise self._exception
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
2024-02-07 10:10:59 | ERROR | stderr |     result = self.fn(*self.args, **self.kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/_snapshot_download.py", line 283, in _inner_hf_hub_download
2024-02-07 10:10:59 | ERROR | stderr |     return hf_hub_download(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
2024-02-07 10:10:59 | ERROR | stderr |     return fn(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py", line 1457, in hf_hub_download
2024-02-07 10:10:59 | ERROR | stderr |     http_get(
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py", line 527, in http_get
2024-02-07 10:10:59 | ERROR | stderr |     temp_file.write(chunk)
2024-02-07 10:10:59 | ERROR | stderr |   File "/usr/lib/python3.9/tempfile.py", line 613, in func_wrapper
2024-02-07 10:10:59 | ERROR | stderr |     return func(*args, **kwargs)
2024-02-07 10:10:59 | ERROR | stderr | OSError: [Errno 28] No space left on device
2024-02-07 10:10:59 | ERROR | stderr | model-00018-of-00019.safetensors:   7%|â–‹         | 336M/4.98G [00:04<01:08, 68.3MB/s]
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m /usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py:983: UserWarning: Not enough free disk space to download the file. The expected file size is: 4899.04 MB. The target location /data/hub only has 0.00 MB free disk space.
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m   warnings.warn(
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m /usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py:983: UserWarning: Not enough free disk space to download the file. The expected file size is: 4983.00 MB. The target location /data/hub only has 0.00 MB free disk space.
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m   warnings.warn(
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m /usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py:983: UserWarning: Not enough free disk space to download the file. The expected file size is: 4899.04 MB. The target location /data/hub/models--mistralai--Mixtral-8x7B-v0.1/blobs only has 0.00 MB free disk space.
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m   warnings.warn(
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m /usr/local/lib/python3.9/dist-packages/huggingface_hub/file_download.py:983: UserWarning: Not enough free disk space to download the file. The expected file size is: 4983.00 MB. The target location /data/hub/models--mistralai--Mixtral-8x7B-v0.1/blobs only has 0.00 MB free disk space.
2024-02-07 10:10:59 | ERROR | stderr | [36m(RayWorkerVllm pid=794)[0m   warnings.warn(
2024-02-07 10:11:01 | ERROR | stderr | --- Logging error ---
2024-02-07 10:11:01 | ERROR | stderr | Traceback (most recent call last):
2024-02-07 10:11:01 | ERROR | stderr |   File "/usr/lib/python3.9/logging/__init__.py", line 1087, in emit
2024-02-07 10:11:01 | ERROR | stderr |     self.flush()
2024-02-07 10:11:01 | ERROR | stderr |   File "/usr/lib/python3.9/logging/__init__.py", line 1067, in flush
2024-02-07 10:11:01 | ERROR | stderr |     self.stream.flush()
2024-02-07 10:11:01 | ERROR | stderr | OSError: [Errno 28] No space left on device
2024-02-07 10:11:01 | ERROR | stderr | Call stack:
2024-02-07 10:11:01 | ERROR | stderr |   File "/usr/lib/python3.9/logging/__init__.py", line 2141, in shutdown
2024-02-07 10:11:01 | ERROR | stderr |     h.flush()
2024-02-07 10:11:01 | ERROR | stderr |   File "/usr/lib/python3.9/logging/__init__.py", line 1067, in flush
2024-02-07 10:11:01 | ERROR | stderr |     self.stream.flush()
2024-02-07 10:11:01 | ERROR | stderr |   File "/app/src/utils.py", line 108, in flush
2024-02-07 10:11:01 | ERROR | stderr |     self.logger.log(self.log_level, encoded_message.rstrip())
2024-02-07 10:11:01 | ERROR | stderr | Message: '\x1b[36m(RayWorkerVllm pid=793)\x1b[0m'
2024-02-07 10:11:01 | ERROR | stderr | Arguments: ()
2024-02-07 10:11:01 | ERROR | stderr | [36m(RayWorkerVllm pid=793)[0m
hmellor commented 9 months ago

@rahuja23 your drive containing /data is full. Does your node have enough storage?

rahuja23 commented 9 months ago

Yeah it seems like the PVC attached was full. Seems to be working now! Thanks 😄

hmellor commented 7 months ago

@pseudotensor can this be closed now?

pseudotensor commented 7 months ago

Hi, I haven't tried AWQ for Mixtral on vLLM lately. Is there reason to suspect the issue is resolved?

hmellor commented 7 months ago

I have been able to successfully deploy it myself on a single A100 40GB (with some careful tweaking of engine settings) using https://huggingface.co/casperhansen/mixtral-instruct-awq, so I believe it could be resolved.

I wanted to check if you were still experiencing issues.