vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.82k stars 4.5k forks source link

[Bug]: TypeError: FlashAttentionMetadata.__init__() missing 10 required positional arguments #5983

Closed lonngxiang closed 4 months ago

lonngxiang commented 4 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

run LLaVA-NeXT error:

python -m vllm.entrypoints.openai.api_server --model /ai/LLaVA-NeXT --image-token-id 32000 --image-input-shape 1,3,336,336 --image-input-type pixel_values --image-feature-size 65856 --chat-template template_llava.jinja --host 19*** --port 10860 --trust-remote-code --tensor-parallel-size 2 --dtype=half --disable-custom-all-reduce

image

DarkLight1337 commented 4 months ago

Please provide more information on your environment by running the command at the beginning of your post (under "Your current environment")

lonngxiang commented 4 months ago

/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( INFO 06-29 02:39:20 api_server.py:177] vLLM API server version 0.5.0.post1 INFO 06-29 02:39:20 api_server.py:178] args: Namespace(host='192.168.2.238', port=10860, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/ai/LLaVA-NeXT', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=65856, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) 2024-06-29 02:39:23,558 INFO worker.py:1749 -- Started a local Ray instance. INFO 06-29 02:39:24 config.py:623] Defaulting to use mp for distributed inference INFO 06-29 02:39:24 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/ai/LLaVA-NeXT', speculative_config=None, tokenizer='/ai/LLaVA-NeXT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/ai/LLaVA-NeXT) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( (VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2 INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=20086) INFO 06-29 02:39:36 model_runner.py:160] Loading model weights took 7.3588 GB INFO 06-29 02:39:37 model_runner.py:160] Loading model weights took 7.3588 GB (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph', Traceback (most recent call last): (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 735, in execute_model (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] ) = self.prepare_input_tensors(seq_group_metadata_list) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 712, in prepare_input_tensors (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] attn_metadata = self.attn_backend.make_metadata( (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return FlashAttentionMetadata(*args, *kwargs) (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph' (VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] rank0: Traceback (most recent call last): rank0: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in rank0: engine = AsyncLLMEngine.from_engine_args( rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(args, kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(*args, **kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init

rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches

rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks

rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 741, in execute_model rank0: prefill_meta = attn_metadata.prefill_metadata

DarkLight1337 commented 4 months ago

This doesn't look like the output of python collect_env.py.

lonngxiang commented 4 months ago

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.17

Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.118.1.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.2.91 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 550.78 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz Stepping: 6 CPU MHz: 2099.998 BogoMIPS: 4199.99 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.3.0 [pip3] torchaudio==2.1.2+cu118 [pip3] torchvision==0.16.2+cu118 [pip3] transformers==4.42.3 [pip3] triton==2.3.0 [conda] numpy 1.26.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] sentence-transformers 2.7.0 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.1.2+cu118 pypi_0 pypi [conda] torchvision 0.16.2+cu118 pypi_0 pypi [conda] transformers 4.42.3 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-7 0 N/A GPU1 PHB X 0-7 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

DarkLight1337 commented 4 months ago

There may be some mismatched Python packages. Try reinstalling your Python environment.

DarkLight1337 commented 4 months ago

I got a similar issue recently and it turns out that it's because vLLM cannot allocate blocks for the model. Here, I think you set image_feature_size to a value that is too high (normally it should be around 2k or so, not 60k).

Anyways, the --image-feature-size argument has since been removed (it is now computed automatically by #6089) so you should not run into this issue anymore.