[Misc]: OOM (CUDA Out Of Memory) when running LLMs in WSL using vLLM

BooleanMind commented 2 months ago

Environment:

WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
WSLg version: 1.0.61
MSRDC version: 1.2.5326
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26091.1-240325-1447.ge-release
Windows version: 10.0.22631.2861
RTX3050, 4GBs VRAM, 40 GBs RAM
CUDA Version: 12.1

vLLM version: 0.5.4

Problem:

When running Qwen2 in WSL using vLLM, I encounter a CUDA Out Of Memory (OOM) error.

Commands and errors:

cmd1:

vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.99 --quantization "gptq"

File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in __torch_function__
    return func(*args, **kwargs)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.35 GiB is allocated by PyTorch, and 125.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I suspected that my 4GB VRAM RTX 3050 was not sufficient to run Qwen2, so I tried to offload to the CPU by using cmd2:

cmd2:

vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --cpu_offload_gb 10 --quantization "gptq"

File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in __torch_function__
return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!

Additional context:

This issue is not limited to Qwen2, as I also encounter a similar OOM error when running neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16.

I would appreciate any help in resolving this issue.

youkaichao commented 2 months ago

GPU 0 has a total capacity of 4.00 GiB

Your GPU is too small to hold a 7B model I think, even with int4 quantization.

--cpu_offload_gb 10

The cpu offloading cannot exceed the model weight. You might try some value like 3.

NotImplementedError: Cannot copy out of meta tensor; no data!

cc @mgoin is this because gptq int4 implementation has some meta tensor after weight loading?

mgoin commented 2 months ago

Hmm gptq shouldn't have meta tensors and should have been resolved by https://github.com/vllm-project/vllm/pull/7225 afaik. I will try to reproduce this now.

youkaichao commented 2 months ago

oh then it might be @BooleanMind is using 0.5.4 , but #7225 is not released yet.

mgoin commented 2 months ago

I confirmed this issue doesn't occur on main on A100, so please wait for the upcoming 0.5.5 release this week.

vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --cpu_offload_gb 2 --quantization "gptq"
...
INFO 08-19 17:24:14 model_runner.py:900] Loading model weights took 3.1711 GB

FYI @BooleanMind when loading the model without cpu offloading, I see it uses at least 5.2GB for model weights - so offloading will be needed in your case

vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.99 --quantization "gptq"
...
INFO 08-19 17:25:34 model_runner.py:900] Loading model weights took 5.2035 GB

BooleanMind commented 2 months ago

I recently updated to the latest vLLM release (v0.5.5) and was able to get past the meta tensor error that was present in previous versions. Great job on that fix!

However, I am still encountering an Out of Memory (OOM) error when trying to load and run small LLM models. I understand that running LLMs on a GPU with limited VRAM (4 GBs) is quite challenging, and my setup is far from ideal. Nevertheless, I've been experimenting with smaller models and have been seeing incremental improvements in performance, so I am eager to continue this work on my laptop, expecially when I am in mobility.

Given that laptops typically come with limited VRAM, especially on more affordable models, it would be immensely helpful if vLLM could provide more robust support for low-memory environments. Any guidance or potential fixes you could offer would be greatly appreciated.

Thank you for your continued efforts on this project!

vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16 --cpu_offload_gb 6

INFO 08-26 14:33:19 api_server.py:440] vLLM API server version 0.5.5
INFO 08-26 14:33:19 api_server.py:441] args: Namespace(model_tag='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=6.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fd6f2453400>)
INFO 08-26 14:33:21 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-26 14:33:21 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/71c3cea0-fc71-4276-accc-06850147ff74 for RPC Path.
INFO 08-26 14:33:21 api_server.py:161] Started engine process with PID 499
INFO 08-26 14:33:26 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-26 14:33:26 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-26 14:33:27 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 08-26 14:33:29 model_runner.py:879] Starting to load model neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16...
INFO 08-26 14:33:31 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 08-26 14:33:31 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00,  4.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00,  4.65s/it]

INFO 08-26 14:33:37 model_runner.py:890] Loading model weights took 1.9634 GB
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
    engine = cls(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
    super().__init__(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in __init__
    self._initialize_kv_caches()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 390, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 113, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 222, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1097, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1455, in execute_model
    output: SamplerOutput = self.model.sample(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 447, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 140, in forward
    logits = _apply_top_k_top_p(logits, sampling_tensors.top_ps,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 290, in _apply_top_k_top_p
    logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 2.83 GiB is allocated by PyTorch, and 424.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 08-26 14:33:51 api_server.py:171] RPCServer process died before responding to readiness probe

youkaichao commented 2 months ago

Loading model weights took 1.9634 GB

your gpu only has 4 GB memory, and even after you offload 6 GB memory, it still needs 2 GB memory for holding the weight.

the rest 2 GB memory puts very high memory pressure, and can lead to OOM.

you should explore various ways to reduce the memory footprint, including but not limited to:

use --enforce-eager to reduce cudagraph memory usage
limit max model length to reduce the memory reserved for long sequence
limit max number of prompts to reduce the memory reserved for large batch size
...

BooleanMind commented 2 months ago

OK. For the benefit of other users, I report this command that is finally working

vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16 --cpu_offload_gb 3 --max-model-len 4096 --gpu_memory_utilization 1.00

vllm-project / vllm

[Misc]: OOM (CUDA Out Of Memory) when running LLMs in WSL using vLLM #7655