Closed lonngxiang closed 4 months ago
Please provide more information on your environment by running the command at the beginning of your post (under "Your current environment")
/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before building torchvision
from source?
warn(
INFO 06-29 02:39:20 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-29 02:39:20 api_server.py:178] args: Namespace(host='192.168.2.238', port=10860, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/ai/LLaVA-NeXT', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=65856, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-06-29 02:39:23,558 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-29 02:39:24 config.py:623] Defaulting to use mp for distributed inference
INFO 06-29 02:39:24 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/ai/LLaVA-NeXT', speculative_config=None, tokenizer='/ai/LLaVA-NeXT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/ai/LLaVA-NeXT)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before building torchvision
from source?
warn(
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:36 model_runner.py:160] Loading model weights took 7.3588 GB
INFO 06-29 02:39:37 model_runner.py:160] Loading model weights took 7.3588 GB
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph', Traceback (most recent call last):
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] output = executor(args, kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, *kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(args, kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 735, in execute_model
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] ) = self.prepare_input_tensors(seq_group_metadata_list)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 712, in prepare_input_tensors
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] attn_metadata = self.attn_backend.make_metadata(
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return FlashAttentionMetadata(*args, *kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph'
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226]
rank0: Traceback (most recent call last):
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
rank0: return _run_code(code, main_globals, None,
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
rank0: exec(code, run_globals)
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 741, in execute_model rank0: prefill_meta = attn_metadata.prefill_metadata
This doesn't look like the output of python collect_env.py
.
Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.17
Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.118.1.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.2.91 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090
Nvidia driver version: 550.78 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz Stepping: 6 CPU MHz: 2099.998 BogoMIPS: 4199.99 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities
Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] sentence-transformers==2.7.0 [pip3] torch==2.3.0 [pip3] torchaudio==2.1.2+cu118 [pip3] torchvision==0.16.2+cu118 [pip3] transformers==4.42.3 [pip3] triton==2.3.0 [conda] numpy 1.26.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] sentence-transformers 2.7.0 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.1.2+cu118 pypi_0 pypi [conda] torchvision 0.16.2+cu118 pypi_0 pypi [conda] transformers 4.42.3 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-7 0 N/A GPU1 PHB X 0-7 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
There may be some mismatched Python packages. Try reinstalling your Python environment.
I got a similar issue recently and it turns out that it's because vLLM cannot allocate blocks for the model. Here, I think you set image_feature_size
to a value that is too high (normally it should be around 2k or so, not 60k).
Anyways, the --image-feature-size
argument has since been removed (it is now computed automatically by #6089) so you should not run into this issue anymore.
Your current environment
🐛 Describe the bug
run LLaVA-NeXT error:
python -m vllm.entrypoints.openai.api_server --model /ai/LLaVA-NeXT --image-token-id 32000 --image-input-shape 1,3,336,336 --image-input-type pixel_values --image-feature-size 65856 --chat-template template_llava.jinja --host 19*** --port 10860 --trust-remote-code --tensor-parallel-size 2 --dtype=half --disable-custom-all-reduce