vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.68k stars 4.66k forks source link

[Bug]: --cpu-offload-gb flag not honored in vllm/vllm-openai container on amazon g5.2xlarge #9367

Open DanielViglione opened 1 month ago

DanielViglione commented 1 month ago

Your current environment

Collecting environment information... /opt/conda/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:16:49) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.0-1070-aws-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2799.998 BogoMIPS: 5599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-7 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.45.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A (dev) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-7 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Model Input Dumps

No response

🐛 Describe the bug

1) Launch an EC2 instance with the following AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Ubuntu 20.04) 20240929

2) Set the instance type to g5.2xlarge.

These are the specs: the NVIDIA A10G Tensor Core GPU is powered by the NVIDIA Ampere Architecture. Beyond that, it boasts more capacity than the Tesla T4s of the Turing architecture. The A10G has 6928 CUDA Cores, 288 Tensor Cores, 24GB of GDDR6 V-RAM (Video RAM or GPU memory), 384-bit memory bus, 600.2 GB/s memory bandwidth, 1695 MHz Boost Clock, 885 MHz Base Clock, and 28,300 million transistors.

3) Run the vllm/vllm-openai image as a container: docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=hf_token." -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Llama-3.2-11B-Vision-Instruct --trust-remote-code --max-model-len 30000 --gpu-memory-utilization 0.90 --dtype float16 --cpu-offload-gb 4

And the output: /usr/local/lib/python3.12/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION INFO 10-15 01:20:29 api_server.py:528] vLLM API server version dev INFO 10-15 01:20:29 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='float16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=4.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False) INFO 10-15 01:20:29 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/5e2be591-e10f-4b91-bf02-c7e659b861a8 for IPC Path. INFO 10-15 01:20:29 api_server.py:179] Started engine process with PID 20 WARNING 10-15 01:20:29 config.py:1674] Casting torch.bfloat16 to torch.float16. /usr/local/lib/python3.12/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION WARNING 10-15 01:20:33 config.py:1674] Casting torch.bfloat16 to torch.float16. INFO 10-15 01:20:38 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-15 01:20:38 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 10-15 01:20:38 selector.py:115] Using XFormers backend. /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 10-15 01:20:39 model_runner.py:1060] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct... INFO 10-15 01:20:39 selector.py:115] Using XFormers backend. INFO 10-15 01:20:39 weight_utils.py:243] Using model weights format ['.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:03<00:14, 3.67s/it] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:05<00:08, 2.76s/it] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:06<00:03, 1.93s/it] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:09<00:02, 2.13s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:11<00:00, 2.23s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:11<00:00, 2.32s/it]

INFO 10-15 01:20:52 model_runner.py:1071] Loading model weights took 19.9073 GB INFO 10-15 01:20:52 enc_dec_model_runner.py:301] Starting profile run for multi-modal models. Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 392, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 141, in from_engine_args return cls( ^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init self.engine = LLMEngine(args, ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 349, in init self._initialize_kv_caches() File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks return self.driver_worker.determine_num_available_blocks() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 359, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/enc_dec_model_runner.py", line 211, in execute_model MultiModalInputs.as_kwargs(multi_modal_kwargs, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 112, in as_kwargs json_mapped = json_map_leaves( ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 937, in json_map_leaves return {k: json_map_leaves(func, v) for k, v in value.items()} ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 943, in json_map_leaves return func(value) ^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 113, in lambda x: x.to(device, non_blocking=True), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.59 GiB. GPU 0 has a total capacity of 21.98 GiB of which 1.58 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.94 GiB is allocated by PyTorch, and 164.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start

It does not recognize the CPU at all. The --cpu-offload-gb flag is not honored. I get the same error whether I use the --cpu-offload-gb option or not:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.59 GiB. GPU 0 has a total capacity of 21.98 GiB of which 1.58 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.94 GiB is allocated by PyTorch, and 164.71 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

It says there is only 23GiB of GPU Memory available. However, the instance comes with 24GiB of GPU Memory and with the 4 additional CPUs, it should be at 28GiB, which is within range of what it is trying to utilize.

Before submitting a new issue...

simon-mo commented 1 month ago

11B models requires a lot more memory I believe, in particular, the KV cache for cross attention is failure large.