microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
https://aka.ms/MInference
MIT License
365 stars 11 forks source link

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

Closed junior-zsy closed 3 days ago

junior-zsy commented 5 days ago

Describe the bug

python run_vllm.py 2024-07-05 15:25:04,647 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2024-07-05 15:25:05,859 INFO worker.py:1771 -- Started a local Ray instance. INFO 07-05 15:25:11 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/xxx/model/Qwen2-7B-Instruct', tokenizer='/xxx/model/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend. [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:42823 (errno: 97 - Address family not supported by protocol). [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol). (RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend. INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 (RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1 (RayWorkerVllm pid=1499766) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol). INFO 07-05 15:25:22 model_runner.py:104] Loading model weights took 7.1441 GB (RayWorkerVllm pid=1499766) INFO 07-05 15:25:23 model_runner.py:104] Loading model weights took 7.1441 GB INFO 07-05 15:25:38 ray_gpu_executor.py:240] # GPU blocks: 49742, # CPU blocks: 9362 Traceback (most recent call last): File "/xxx/code/MInference/examples/run_vllm.py", line 31, in llm = minference_patch(llm) File "/xxx/code/MInference/minference/models_patch.py", line 39, in call return self.patch_model(model) File "/xxx/code/MInference/minference/models_patch.py", line 102, in patch_model model = minference_patch_vllm(model, self.config.config_path) File "/xxx/code/MInference/minference/patch.py", line 1072, in minference_patch_vllm attn_forward = minference_vllm_forward(config) File "/xxx/code/MInference/minference/modules/minference_forward.py", line 771, in minference_vllm_forward attn_metadata: AttentionMetadata[FlashAttentionMetadata], TypeError: 'type' object is not subscriptable

dependent on: vllm 0.4.0 flash-attn 2.5.9.post1 torch 2.1.2 triton 2.1.0

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

iofu728 commented 5 days ago

Hi @junior-zsy, thanks for your feedback.

This issue is caused by the vllm version. Currently, we support vllm==0.4.1.

Please update minference to 0.1.4 pip install minference==0.1.4, which fixes some other bugs #14. This update also makes minference potentially compatible with vllm==0.4.0.

junior-zsy commented 5 days ago

@iofu728 I used multi card inference and reported an error,error : Traceback (most recent call last): File "/xxx/code/MInference/examples/run_vllm.py", line 54, in llm = minference_patch(llm) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call return self.patch_model(model) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model model = minference_patch_vllm(model, self.config.config_path) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module) AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner' [W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic,Can I use vllm 0.4.3 inference

cyLi-Tiger commented 5 days ago

@iofu728 I used multi card inference and reported an error,error : Traceback (most recent call last): File "/xxx/code/MInference/examples/run_vllm.py", line 54, in llm = minference_patch(llm) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call return self.patch_model(model) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model model = minference_patch_vllm(model, self.config.config_path) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module) AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner' [W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic,Can I use vllm 0.4.3 inference

I also have the same need for Qwen2-7B-Instruct running on vllm.

iofu728 commented 3 days ago

Hi @junior-zsy and @cyLi-Tiger, we fix this issue in 0.1.4.post1.

Please update MInference to version 0.1.4.post1. If the issue persists, feel free to reopen this issue.

cyLi-Tiger commented 1 day ago

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend. INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB Traceback (most recent call last): File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in llm = LLM( File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init self.llm_engine = LLMEngine.from_engine_args( File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args engine = cls( File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init self._initialize_kv_caches() File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks return self.driver_worker.determine_num_available_blocks() File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks self.model_runner.profile_run() File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run self.execute_model(seqs, kv_caches) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model hidden_states = model_executable(execute_model_kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward hidden_states, residual = layer( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward hidden_states = self.self_attn( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata, File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward out = flash_attn_varlen_func( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func return FlashAttnVarlenFunc.apply( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

iofu728 commented 1 day ago

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend. INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB Traceback (most recent call last): File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in llm = LLM( File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init self.llm_engine = LLMEngine.from_engine_args( File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args engine = cls( File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init self._initialize_kv_caches() File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks return self.driver_worker.determine_num_available_blocks() File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks self.model_runner.profile_run() File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run self.execute_model(seqs, kv_caches) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model hidden_states = model_executable(execute_model_kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward hidden_states, residual = layer( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward hidden_states = self.self_attn( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward return self.impl.forward(query, key, value, kv_cache, attn_metadata, File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward out = flash_attn_varlen_func( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func return FlashAttnVarlenFunc.apply( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Hi @cyLi-Tiger, thanks for your feedback. I tested vllm==0.4.1 with flash_attn==2.5.8 and vllm==0.4.3 with flash_attn==0.4.2, and it works well using Qwen2. Could you try reinstalling minference and flash_attn, and then run vllm again?