Closed BooleanMind closed 2 months ago
GPU 0 has a total capacity of 4.00 GiB
Your GPU is too small to hold a 7B model I think, even with int4 quantization.
--cpu_offload_gb 10
The cpu offloading cannot exceed the model weight. You might try some value like 3
.
NotImplementedError: Cannot copy out of meta tensor; no data!
cc @mgoin is this because gptq int4 implementation has some meta tensor after weight loading?
Hmm gptq shouldn't have meta tensors and should have been resolved by https://github.com/vllm-project/vllm/pull/7225 afaik. I will try to reproduce this now.
oh then it might be @BooleanMind is using 0.5.4 , but #7225 is not released yet.
I confirmed this issue doesn't occur on main on A100, so please wait for the upcoming 0.5.5 release this week.
vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --cpu_offload_gb 2 --quantization "gptq"
...
INFO 08-19 17:24:14 model_runner.py:900] Loading model weights took 3.1711 GB
FYI @BooleanMind when loading the model without cpu offloading, I see it uses at least 5.2GB for model weights - so offloading will be needed in your case
vllm serve Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --gpu-memory-utilization 0.99 --quantization "gptq"
...
INFO 08-19 17:25:34 model_runner.py:900] Loading model weights took 5.2035 GB
I recently updated to the latest vLLM release (v0.5.5) and was able to get past the meta tensor error that was present in previous versions. Great job on that fix!
However, I am still encountering an Out of Memory (OOM) error when trying to load and run small LLM models. I understand that running LLMs on a GPU with limited VRAM (4 GBs) is quite challenging, and my setup is far from ideal. Nevertheless, I've been experimenting with smaller models and have been seeing incremental improvements in performance, so I am eager to continue this work on my laptop, expecially when I am in mobility.
Given that laptops typically come with limited VRAM, especially on more affordable models, it would be immensely helpful if vLLM could provide more robust support for low-memory environments. Any guidance or potential fixes you could offer would be greatly appreciated.
Thank you for your continued efforts on this project!
vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16 --cpu_offload_gb 6
INFO 08-26 14:33:19 api_server.py:440] vLLM API server version 0.5.5
INFO 08-26 14:33:19 api_server.py:441] args: Namespace(model_tag='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=6.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7fd6f2453400>)
INFO 08-26 14:33:21 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-26 14:33:21 api_server.py:144] Multiprocessing frontend to use ipc:///tmp/71c3cea0-fc71-4276-accc-06850147ff74 for RPC Path.
INFO 08-26 14:33:21 api_server.py:161] Started engine process with PID 499
INFO 08-26 14:33:26 gptq_marlin.py:102] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-26 14:33:26 llm_engine.py:184] Initializing an LLM engine (v0.5.5) with config: model='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', speculative_config=None, tokenizer='neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16, use_v2_block_manager=False, enable_prefix_caching=False)
WARNING 08-26 14:33:27 utils.py:721] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 08-26 14:33:29 model_runner.py:879] Starting to load model neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16...
INFO 08-26 14:33:31 weight_utils.py:236] Using model weights format ['*.safetensors']
INFO 08-26 14:33:31 weight_utils.py:280] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:04<00:00, 4.65s/it]
INFO 08-26 14:33:37 model_runner.py:890] Loading model weights took 1.9634 GB
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
engine = cls(
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
return engine_class(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
super().__init__(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 284, in __init__
self._initialize_kv_caches()
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 390, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 113, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 222, in determine_num_available_blocks
self.model_runner.profile_run()
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1097, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1455, in execute_model
output: SamplerOutput = self.model.sample(
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 447, in sample
next_tokens = self.sampler(logits, sampling_metadata)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 140, in forward
logits = _apply_top_k_top_p(logits, sampling_tensors.top_ps,
File "/home/ubuntu/.local/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 290, in _apply_top_k_top_p
logits_sort, logits_idx = logits.sort(dim=-1, descending=False)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 376.00 MiB. GPU 0 has a total capacity of 4.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 2.83 GiB is allocated by PyTorch, and 424.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 08-26 14:33:51 api_server.py:171] RPCServer process died before responding to readiness probe
Loading model weights took 1.9634 GB
your gpu only has 4 GB memory, and even after you offload 6 GB memory, it still needs 2 GB memory for holding the weight.
the rest 2 GB memory puts very high memory pressure, and can lead to OOM.
you should explore various ways to reduce the memory footprint, including but not limited to:
--enforce-eager
to reduce cudagraph memory usageOK. For the benefit of other users, I report this command that is finally working
vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16 --cpu_offload_gb 3 --max-model-len 4096 --gpu_memory_utilization 1.00
Environment:
vLLM version: 0.5.4
Problem:
When running Qwen2 in WSL using vLLM, I encounter a CUDA Out Of Memory (OOM) error.
Commands and errors:
cmd1:
I suspected that my 4GB VRAM RTX 3050 was not sufficient to run Qwen2, so I tried to offload to the CPU by using cmd2:
cmd2:
Additional context:
This issue is not limited to Qwen2, as I also encounter a similar OOM error when running neuralmagic/Meta-Llama-3-8B-Instruct-quantized.w4a16.
I would appreciate any help in resolving this issue.