vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.31k stars 3.14k forks source link

[Bug]:Qwen2-57B-A14B 两卡 推理报错 #5692

Open CXLiang123 opened 2 weeks ago

CXLiang123 commented 2 weeks ago

Your current environment

环境: torch 2.3.0 vllm 0.5.0.post1 transformers 4.41.2

主要报错情况: moe小一点的模型 '/data/models/qwen/qwen1.5-2.7Bmoe' 不会出问题 对于大一点的就报错如最下面。

代码:

from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.transformers_utils.tokenizer import get_tokenizer

engine_args = AsyncEngineArgs( model='/data/models/Qwen/Qwen2-57B-A14B', tokenizer_mode='auto', trust_remote_code=True, dtype='bfloat16', tensor_parallel_size=2, gpu_memory_utilization=0.90 )

engine = AsyncLLMEngine.from_engine_args(engine_args)

报错信息 见最下方。麻烦大神了

🐛 Describe the bug

2024-06-20 02:53:02,678 INFO worker.py:1724 -- Started a local Ray instance. INFO 06-20 02:53:03 config.py:623] Defaulting to use mp for distributed inference INFO 06-20 02:53:03 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/Qwen/Qwen2-57B-A14B', speculative_config=None, tokenizer='/data/models/Qwen/Qwen2-57B-A14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/models/Qwen/Qwen2-57B-A14B) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. (VllmWorkerProcess pid=38210) INFO 06-20 02:53:04 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 06-20 02:53:04 utils.py:637] Found nccl from library libnccl.so.2 INFO 06-20 02:53:04 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=38210) INFO 06-20 02:53:04 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=38210) INFO 06-20 02:53:04 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 06-20 02:53:05 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=38210) INFO 06-20 02:53:05 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 06-20 02:53:05 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=38210) WARNING 06-20 02:53:05 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=38210) INFO 06-20 02:53:49 model_runner.py:160] Loading model weights took 53.5051 GB INFO 06-20 02:53:49 model_runner.py:160] Loading model weights took 53.5051 GB (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA error: an illegal memory access was encountered (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] , Traceback (most recent call last): (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] hidden_states = model_executable( (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] hidden_states = self.mlp(hidden_states) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] final_hidden_states = fused_moe(hidden_states, (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return fused_experts(hidden_states, (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 462, in fused_experts (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] return torch.sum(intermediate_cache3.view(intermediate_cache3.shape), (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] RuntimeError: CUDA error: an illegal memory access was encountered (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=38210) ERROR 06-20 02:53:52 multiproc_worker_utils.py:226] rank0: Traceback (most recent call last): rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/vllm_server.py", line 13, in rank0: from api.models import EMBEDDED_MODEL, VLLM_ENGINE rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 91, in rank0: VLLM_ENGINE = get_vllm_engine() if config.USE_VLLM else None # model for vllm generate rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 67, in get_vllm_engine rank0: engine = AsyncLLMEngine.from_engine_args(engine_args) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model rank0: hidden_states = model_executable( rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward rank0: hidden_states, residual = layer(positions, hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward rank0: hidden_states = self.mlp(hidden_states) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward rank0: final_hidden_states = fused_moe(hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe rank0: return fused_experts(hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 462, in fused_experts rank0: return torch.sum(intermediate_cache3.view(intermediate_cache3.shape), rank0: RuntimeError: CUDA error: an illegal memory access was encountered rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

SIGTERM received at time=1718823233 on cpu 21 PC: @ 0x7f8f4cd6f75d (unknown) read @ 0x7f8f4cd70630 (unknown) (unknown) [2024-06-20 02:53:53,879 E 38210 36104] logging.cc:361: SIGTERM received at time=1718823233 on cpu 21 [2024-06-20 02:53:53,879 E 38210 36104] logging.cc:361: PC: @ 0x7f8f4cd6f75d (unknown) read [2024-06-20 02:53:53,879 E 38210 36104] logging.cc:361: @ 0x7f8f4cd70630 (unknown) (unknown)

youkaichao commented 2 weeks ago

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to locate the error first.

CXLiang123 commented 2 weeks ago

https://docs.vllm.ai/en/latest/getting_started/debugging.html

感谢您的回复,我按照该文档加上调试信息后的log如下,还是有点蒙圈,劳烦看一下,log如下, 对了我的卡是4张a100,然后用了其中两张来运行。 (llm_server3) [root@test2 alg_backend]# CUDA_VISIBLE_DEVICES=2,3 python llm_server.py default_model: jpt-eshop_qwenmoe Config: {'HOST': '0.0.0.0', 'PORT': 8009, 'MODEL_NAME': 'jpt-eshop_qwenmoe', 'MODEL_PATH': '/data/models/Qwen/Qwen2-57B-A14B', 'ADAPTER_MODEL_PATH': None, 'DEVICE': 'cuda', 'DEVICE_MAP': 'auto', 'GPUS': '', 'NUM_GPUs': 2, 'QUANTIZE': 16, 'EMBEDDING_NAME': None, 'CONTEXT_LEN': None, 'LOAD_IN_8BIT': False, 'LOAD_IN_4BIT': False, 'USING_PTUNING_V2': False, 'STREAM_INTERVERL': 2, 'PROMPT_NAME': 'qwen', 'PATCH_TYPE': None, 'TRAINING_LENGTH': 4096, 'WINDOW_SIZE': 512, 'API_PREFIX': '/v1', 'USE_VLLM': True, 'TRUST_REMOTE_CODE': True, 'TOKENIZE_MODE': 'auto', 'TENSOR_PARALLEL_SIZE': 2, 'DTYPE': 'bfloat16', 'EMBEDDING_SIZE': None} 2024-06-20 13:52:30,290 INFO worker.py:1724 -- Started a local Ray instance. INFO 06-20 13:52:30 config.py:623] Defaulting to use mp for distributed inference INFO 06-20 13:52:30 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/Qwen/Qwen2-57B-A14B', speculative_config=None, tokenizer='/data/models/Qwen/Qwen2-57B-A14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/models/Qwen/Qwen2-57B-A14B) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-20 13:52:31 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. INFO 06-20 13:52:31 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-983a6eb38ad64ec08e01b641d0d0ab24/VLLM_TRACE_FUNCTION_for_process_2136_thread_139961350321984_at_2024-06-20_13:52:31.136067.log (VllmWorkerProcess pid=4234) WARNING 06-20 13:52:31 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (VllmWorkerProcess pid=4234) INFO 06-20 13:52:31 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-983a6eb38ad64ec08e01b641d0d0ab24/VLLM_TRACE_FUNCTION_for_process_4234_thread_139961350321984_at_2024-06-20_13:52:31.136796.log (VllmWorkerProcess pid=4234) INFO 06-20 13:52:31 multiproc_worker_utils.py:215] Worker ready; awaiting tasks DEBUG 06-20 13:52:31 parallel_state.py:526] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.169:37721 backend=nccl (VllmWorkerProcess pid=4234) DEBUG 06-20 13:52:31 parallel_state.py:526] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://192.168.1.169:37721 backend=nccl INFO 06-20 13:52:31 utils.py:637] Found nccl from library libnccl.so.2 INFO 06-20 13:52:31 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=4234) INFO 06-20 13:52:31 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=4234) INFO 06-20 13:52:31 pynccl.py:63] vLLM is using nccl==2.20.5 test2:2136:2136 [0] NCCL INFO Bootstrap : Using eth0:192.168.1.169<0> test2:2136:2136 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation test2:2136:2136 [0] NCCL INFO cudaDriverVersion 12030 NCCL version 2.20.5+cuda12.4 test2:4234:4234 [1] NCCL INFO cudaDriverVersion 12030 test2:4234:4234 [1] NCCL INFO Bootstrap : Using eth0:192.168.1.169<0> test2:4234:4234 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation test2:2136:2136 [0] NCCL INFO Failed to open libibverbs.so[.1] test2:2136:2136 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.169<0> [1]br-2db799344e54:172.19.0.1<0> [2]br-6826f131a211:172.18.0.1<0> test2:2136:2136 [0] NCCL INFO Using non-device net plugin version 0 test2:2136:2136 [0] NCCL INFO Using network Socket test2:4234:4234 [1] NCCL INFO Failed to open libibverbs.so[.1] test2:4234:4234 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.169<0> [1]br-2db799344e54:172.19.0.1<0> [2]br-6826f131a211:172.18.0.1<0> test2:4234:4234 [1] NCCL INFO Using non-device net plugin version 0 test2:4234:4234 [1] NCCL INFO Using network Socket test2:4234:4234 [1] NCCL INFO comm 0xb585490 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x6bdfae83815c85eb - Init START test2:2136:2136 [0] NCCL INFO comm 0xb588d60 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x6bdfae83815c85eb - Init START test2:2136:2136 [0] NCCL INFO comm 0xb588d60 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 test2:4234:4234 [1] NCCL INFO comm 0xb585490 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 test2:2136:2136 [0] NCCL INFO Channel 00/02 : 0 1 test2:4234:4234 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 test2:2136:2136 [0] NCCL INFO Channel 01/02 : 0 1 test2:4234:4234 [1] NCCL INFO P2P Chunksize set to 131072 test2:2136:2136 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 test2:2136:2136 [0] NCCL INFO P2P Chunksize set to 131072 test2:4234:4234 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. test2:2136:2136 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. test2:4234:4234 [1] NCCL INFO Channel 00 : 1[3] -> 0[2] via SHM/direct/direct test2:2136:2136 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct test2:4234:4234 [1] NCCL INFO Channel 01 : 1[3] -> 0[2] via SHM/direct/direct test2:2136:2136 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct test2:4234:4234 [1] NCCL INFO Connected all rings test2:4234:4234 [1] NCCL INFO Connected all trees test2:2136:2136 [0] NCCL INFO Connected all rings test2:4234:4234 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:2136:2136 [0] NCCL INFO Connected all trees test2:4234:4234 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:2136:2136 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:2136:2136 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:2136:2136 [0] NCCL INFO comm 0xb588d60 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x6bdfae83815c85eb - Init COMPLETE test2:4234:4234 [1] NCCL INFO comm 0xb585490 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x6bdfae83815c85eb - Init COMPLETE INFO 06-20 13:52:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_2,3.json (VllmWorkerProcess pid=4234) INFO 06-20 13:52:32 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_2,3.json WARNING 06-20 13:52:32 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4234) WARNING 06-20 13:52:32 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=4234) INFO 06-20 13:53:38 model_runner.py:160] Loading model weights took 53.5051 GB INFO 06-20 13:53:38 model_runner.py:160] Loading model weights took 53.5051 GB test2:2136:5197 [0] NCCL INFO Using non-device net plugin version 0 test2:2136:5197 [0] NCCL INFO Using network Socket test2:4234:5198 [1] NCCL INFO Using non-device net plugin version 0 test2:4234:5198 [1] NCCL INFO Using network Socket test2:4234:5198 [1] NCCL INFO comm 0x176655e0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x2bc6282111042126 - Init START test2:2136:5197 [0] NCCL INFO comm 0x1832e060 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x2bc6282111042126 - Init START test2:4234:5198 [1] NCCL INFO comm 0x176655e0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 test2:2136:5197 [0] NCCL INFO comm 0x1832e060 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 test2:4234:5198 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 test2:2136:5197 [0] NCCL INFO Channel 00/02 : 0 1 test2:4234:5198 [1] NCCL INFO P2P Chunksize set to 131072 test2:2136:5197 [0] NCCL INFO Channel 01/02 : 0 1 test2:2136:5197 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 test2:2136:5197 [0] NCCL INFO P2P Chunksize set to 131072 test2:2136:5197 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct test2:4234:5198 [1] NCCL INFO Channel 00 : 1[3] -> 0[2] via SHM/direct/direct test2:4234:5198 [1] NCCL INFO Channel 01 : 1[3] -> 0[2] via SHM/direct/direct test2:2136:5197 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct test2:2136:5197 [0] NCCL INFO Connected all rings test2:2136:5197 [0] NCCL INFO Connected all trees test2:4234:5198 [1] NCCL INFO Connected all rings test2:4234:5198 [1] NCCL INFO Connected all trees test2:4234:5198 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:4234:5198 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:2136:5197 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:2136:5197 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:2136:5197 [0] NCCL INFO comm 0x1832e060 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x2bc6282111042126 - Init COMPLETE test2:4234:5198 [1] NCCL INFO comm 0x176655e0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x2bc6282111042126 - Init COMPLETE (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Triton Error [CUDA]: an illegal memory access was encountered, Traceback (most recent call last): (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] hidden_states = model_executable( (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] hidden_states = self.mlp(hidden_states) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] final_hidden_states = fused_moe(hidden_states, (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return fused_experts(hidden_states, (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 445, in fused_experts (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] invoke_fused_moe_kernel(intermediate_cache2, (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] fused_moe_kernel[grid]( (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, *kwargs) (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered (VllmWorkerProcess pid=4234) ERROR 06-20 13:53:42 multiproc_worker_utils.py:226] rank0: Traceback (most recent call last): rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/vllm_server.py", line 22, in rank0: from api.models import EMBEDDED_MODEL, VLLM_ENGINE rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 92, in rank0: VLLM_ENGINE = get_vllm_engine() if config.USE_VLLM else None # model for vllm generate rank0: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 68, in get_vllm_engine rank0: engine = AsyncLLMEngine.from_engine_args(engine_args) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args rank0: engine = cls( rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init rank0: self.engine = self._init_engine(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine rank0: return engine_class(*args, **kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", ) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(args, **kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run rank0: self.execute_model(seqs, kv_caches) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model rank0: hidden_states = model_executable( rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward rank0: hidden_states, residual = layer(positions, hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward rank0: hidden_states = self.mlp(hidden_states) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward rank0: final_hidden_states = fused_moe(hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe rank0: return fused_experts(hidden_states, rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 445, in fused_experts

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel

rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in rank0: return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, *kwargs) rank0: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run rank0: kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance rank0: RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered SIGTERM received at time=1718862824 on cpu 8 PC: @ 0x7f4b4a4fc75d (unknown) __read @ 0x7f4b4a4fd630 (unknown) (unknown) [2024-06-20 13:53:44,200 E 4234 2136] logging.cc:361: SIGTERM received at time=1718862824 on cpu 8 *** [2024-06-20 13:53:44,200 E 4234 2136] logging.cc:361: PC: @ 0x7f4b4a4fc75d (unknown) __read [2024-06-20 13:53:44,200 E 4234 2136] logging.cc:361: @ 0x7f4b4a4fd630 (unknown) (unknown) INFO 06-20 13:53:45 multiproc_worker_utils.py:123] Killing local vLLM worker processes

CXLiang123 commented 2 weeks ago

https://docs.vllm.ai/en/latest/getting_started/debugging.html

感谢您的回复,我按照该文档加上调试信息后的log如下,还是有点蒙圈,劳烦看一下,log如下, 对了我的卡是4张a100,然后用了其中两张来运行。 (llm_server3) [root@test2 alg_backend]# CUDA_VISIBLE_DEVICES=2,3 python llm_server.py default_model: jpt-eshop_qwenmoe Config: {'HOST': '0.0.0.0', 'PORT': 8009, 'MODEL_NAME': 'jpt-eshop_qwenmoe', 'MODEL_PATH': '/data/models/Qwen/Qwen2-57B-A14B', 'ADAPTER_MODEL_PATH': None, 'DEVICE': 'cuda', 'DEVICE_MAP': 'auto', 'GPUS': '', 'NUM_GPUs': 2, 'QUANTIZE': 16, 'EMBEDDING_NAME': None, 'CONTEXT_LEN': None, 'LOAD_IN_8BIT': False, 'LOAD_IN_4BIT': False, 'USING_PTUNING_V2': False, 'STREAM_INTERVERL': 2, 'PROMPT_NAME': 'qwen', 'PATCH_TYPE': None, 'TRAINING_LENGTH': 4096, 'WINDOW_SIZE': 512, 'API_PREFIX': '/v1', 'USE_VLLM': True, 'TRUST_REMOTE_CODE': True, 'TOKENIZE_MODE': 'auto', 'TENSOR_PARALLEL_SIZE': 2, 'DTYPE': 'bfloat16', 'EMBEDDING_SIZE': None} 2024-06-20 13:40:37,336 INFO worker.py:1724 -- Started a local Ray instance. INFO 06-20 13:40:37 config.py:623] Defaulting to use mp for distributed inference INFO 06-20 13:40:37 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/Qwen/Qwen2-57B-A14B', speculative_config=None, tokenizer='/data/models/Qwen/Qwen2-57B-A14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/models/Qwen/Qwen2-57B-A14B) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 06-20 13:40:38 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. INFO 06-20 13:40:38 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-78a998bd23db4527ab6604d58ea1cea4/VLLM_TRACE_FUNCTION_for_process_56697_thread_140176676636480_at_2024-06-20_13:40:38.200233.log (VllmWorkerProcess pid=58772) WARNING 06-20 13:40:38 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only. (VllmWorkerProcess pid=58772) INFO 06-20 13:40:38 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-78a998bd23db4527ab6604d58ea1cea4/VLLM_TRACE_FUNCTION_for_process_58772_thread_140176676636480_at_2024-06-20_13:40:38.200989.log (VllmWorkerProcess pid=58772) INFO 06-20 13:40:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks DEBUG 06-20 13:40:38 parallel_state.py:526] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://192.168.1.169:42621 backend=nccl (VllmWorkerProcess pid=58772) DEBUG 06-20 13:40:38 parallel_state.py:526] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://192.168.1.169:42621 backend=nccl INFO 06-20 13:40:39 utils.py:637] Found nccl from library libnccl.so.2 INFO 06-20 13:40:39 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=58772) INFO 06-20 13:40:39 utils.py:637] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=58772) INFO 06-20 13:40:39 pynccl.py:63] vLLM is using nccl==2.20.5 test2:56697:56697 [0] NCCL INFO Bootstrap : Using eth0:192.168.1.169<0> test2:56697:56697 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation test2:56697:56697 [0] NCCL INFO cudaDriverVersion 12030 NCCL version 2.20.5+cuda12.4 test2:58772:58772 [1] NCCL INFO cudaDriverVersion 12030 test2:58772:58772 [1] NCCL INFO Bootstrap : Using eth0:192.168.1.169<0> test2:58772:58772 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation test2:56697:56697 [0] NCCL INFO Failed to open libibverbs.so[.1] test2:56697:56697 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.169<0> [1]br-2db799344e54:172.19.0.1<0> [2]br-6826f131a211:172.18.0.1<0> test2:56697:56697 [0] NCCL INFO Using non-device net plugin version 0 test2:56697:56697 [0] NCCL INFO Using network Socket test2:58772:58772 [1] NCCL INFO Failed to open libibverbs.so[.1] test2:58772:58772 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.169<0> [1]br-2db799344e54:172.19.0.1<0> [2]br-6826f131a211:172.18.0.1<0> test2:58772:58772 [1] NCCL INFO Using non-device net plugin version 0 test2:58772:58772 [1] NCCL INFO Using network Socket test2:58772:58772 [1] NCCL INFO comm 0xc261020 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x2863420e521e68fe - Init START test2:56697:56697 [0] NCCL INFO comm 0xc264720 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x2863420e521e68fe - Init START test2:58772:58772 [1] NCCL INFO comm 0xc261020 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 test2:56697:56697 [0] NCCL INFO comm 0xc264720 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 test2:56697:56697 [0] NCCL INFO Channel 00/02 : 0 1 test2:58772:58772 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 test2:56697:56697 [0] NCCL INFO Channel 01/02 : 0 1 test2:58772:58772 [1] NCCL INFO P2P Chunksize set to 131072 test2:56697:56697 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 test2:56697:56697 [0] NCCL INFO P2P Chunksize set to 131072 test2:58772:58772 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. test2:56697:56697 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0. test2:56697:56697 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct test2:58772:58772 [1] NCCL INFO Channel 00 : 1[3] -> 0[2] via SHM/direct/direct test2:56697:56697 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct test2:58772:58772 [1] NCCL INFO Channel 01 : 1[3] -> 0[2] via SHM/direct/direct test2:56697:56697 [0] NCCL INFO Connected all rings test2:56697:56697 [0] NCCL INFO Connected all trees test2:58772:58772 [1] NCCL INFO Connected all rings test2:58772:58772 [1] NCCL INFO Connected all trees test2:58772:58772 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:58772:58772 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:56697:56697 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:56697:56697 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:56697:56697 [0] NCCL INFO comm 0xc264720 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0x2863420e521e68fe - Init COMPLETE test2:58772:58772 [1] NCCL INFO comm 0xc261020 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0x2863420e521e68fe - Init COMPLETE INFO 06-20 13:40:39 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_2,3.json (VllmWorkerProcess pid=58772) INFO 06-20 13:40:39 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_2,3.json WARNING 06-20 13:40:39 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=58772) WARNING 06-20 13:40:39 custom_all_reduce.py:175] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=58772) INFO 06-20 13:41:44 model_runner.py:160] Loading model weights took 53.5051 GB INFO 06-20 13:41:45 model_runner.py:160] Loading model weights took 53.5051 GB test2:56697:59440 [0] NCCL INFO Using non-device net plugin version 0 test2:56697:59440 [0] NCCL INFO Using network Socket test2:58772:59441 [1] NCCL INFO Using non-device net plugin version 0 test2:58772:59441 [1] NCCL INFO Using network Socket test2:58772:59441 [1] NCCL INFO comm 0x183404d0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0xe478ef85b5692b1b - Init START test2:56697:59440 [0] NCCL INFO comm 0x187b96f0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0xe478ef85b5692b1b - Init START test2:58772:59441 [1] NCCL INFO comm 0x183404d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0 test2:56697:59440 [0] NCCL INFO comm 0x187b96f0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 test2:58772:59441 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 test2:56697:59440 [0] NCCL INFO Channel 00/02 : 0 1 test2:58772:59441 [1] NCCL INFO P2P Chunksize set to 131072 test2:56697:59440 [0] NCCL INFO Channel 01/02 : 0 1 test2:56697:59440 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 test2:56697:59440 [0] NCCL INFO P2P Chunksize set to 131072 test2:58772:59441 [1] NCCL INFO Channel 00 : 1[3] -> 0[2] via SHM/direct/direct test2:56697:59440 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct test2:58772:59441 [1] NCCL INFO Channel 01 : 1[3] -> 0[2] via SHM/direct/direct test2:56697:59440 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct test2:56697:59440 [0] NCCL INFO Connected all rings test2:56697:59440 [0] NCCL INFO Connected all trees test2:58772:59441 [1] NCCL INFO Connected all rings test2:58772:59441 [1] NCCL INFO Connected all trees test2:58772:59441 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:58772:59441 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:56697:59440 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 test2:56697:59440 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer test2:58772:59441 [1] NCCL INFO comm 0x183404d0 rank 1 nranks 2 cudaDev 1 nvmlDev 3 busId b0 commId 0xe478ef85b5692b1b - Init COMPLETE test2:56697:59440 [0] NCCL INFO comm 0x187b96f0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId a0 commId 0xe478ef85b5692b1b - Init COMPLETE (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: Triton Error [CUDA]: an illegal memory access was encountered, Traceback (most recent call last): (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] output = executor(*args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] self.model_runner.profile_run() (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return func(args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return func(*args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] hidden_states = model_executable( (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return self._call_impl(*args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return forward_call(*args, *kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] hidden_states, residual = layer(positions, hidden_states, (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return self._call_impl(args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return forward_call(*args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] hidden_states = self.mlp(hidden_states) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return self._call_impl(*args, *kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return forward_call(args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] final_hidden_states = fused_moe(hidden_states, (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return fused_experts(hidden_states, (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 445, in fused_experts (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] invoke_fused_moe_kernel(intermediate_cache2, (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] fused_moe_kernel[grid]( (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] return lambda *args, kwargs: self.run(grid=grid, warmup=False, *args, kwargs) (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered (VllmWorkerProcess pid=58772) ERROR 06-20 13:41:48 multiproc_worker_utils.py:226] [rank0]: Traceback (most recent call last): [rank0]: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/vllm_server.py", line 22, in [rank0]: from api.models import EMBEDDED_MODEL, VLLM_ENGINE [rank0]: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 91, in [rank0]: VLLM_ENGINE = get_vllm_engine() if config.USE_VLLM else None # model for vllm generate [rank0]: File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/api-for-open-llm/api/models.py", line 67, in get_vllm_engine [rank0]: engine = AsyncLLMEngine.from_engine_args(engine_args) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args [rank0]: engine = cls( [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init* [rank0]: self.engine = self._init_engine(args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine [rank0]: return engine_class(*args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init [rank0]: self._initialize_kv_caches() [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches [rank0]: self.model_executor.determine_num_available_blocks()) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks [rank0]: num_blocks = self._run_workers("determine_num_available_blocks", ) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers [rank0]: driver_worker_output = driver_worker_method(*args, *kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks [rank0]: self.model_runner.profile_run() [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run [rank0]: self.execute_model(seqs, kv_caches) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, *kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 749, in execute_model [rank0]: hidden_states = model_executable( [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 401, in forward [rank0]: hidden_states = self.model(input_ids, positions, kv_caches, [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 369, in forward [rank0]: hidden_states, residual = layer(positions, hidden_states, [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(*args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, *kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 329, in forward [rank0]: hidden_states = self.mlp(hidden_states) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 165, in forward [rank0]: final_hidden_states = fused_moe(hidden_states, [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515, in fused_moe [rank0]: return fused_experts(hidden_states, [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 445, in fused_experts [rank0]: invoke_fused_moe_kernel(intermediate_cache2, [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel [rank0]: fused_moe_kernel[grid]( [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in [rank0]: return lambda *args, *kwargs: self.run(grid=grid, warmup=False, args, kwargs) [rank0]: File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run [rank0]: kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance [rank0]: RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered SIGTERM received at time=1718862110 on cpu 8 PC: @ 0x7f7d6cc2975d (unknown) read @ 0x7f7d6cc2a630 (unknown) (unknown) [2024-06-20 13:41:50,332 E 58772 56697] logging.cc:361: SIGTERM received at time=1718862110 on cpu 8 [2024-06-20 13:41:50,332 E 58772 56697] logging.cc:361: PC: @ 0x7f7d6cc2975d (unknown) read [2024-06-20 13:41:50,332 E 58772 56697] logging.cc:361: @ 0x7f7d6cc2a630 (unknown) (unknown) INFO 06-20 13:41:50 multiproc_worker_utils.py:123] Killing local vLLM worker processes ^CTraceback (most recent call last): File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn sock = connection.create_connection( File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen response = self._make_request( File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connectionpool.py", line 496, in _make_request conn.request( File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connection.py", line 395, in request self.endheaders() File "/root/miniconda3/envs/llm_server3/lib/python3.10/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/root/miniconda3/envs/llm_server3/lib/python3.10/http/client.py", line 1038, in _send_output self.send(msg) File "/root/miniconda3/envs/llm_server3/lib/python3.10/http/client.py", line 976, in send self.connect() File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connection.py", line 243, in connect self.sock = self._new_conn() File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connection.py", line 218, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0010c2e890>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen retries = retries.increment( File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=8009): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0010c2e890>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/LLM_Manager.py", line 201, in _wait_for_llm_server response = requests.post(url=url, json=data) File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/root/miniconda3/envs/llm_server3/lib/python3.10/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8009): Max retries exceeded with url: /v1/chat/completions (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0010c2e890>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/cxl/com/Doc_QA/alg_backend/llm_server.py", line 55, in llm_manager.start_llm(llm_manager.default_model) File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/LLM_Manager.py", line 230, in start_llm res = self._wait_for_llm_server() File "/data/cxl/com/Doc_QA/alg_backend/llm_manager/LLM_Manager.py", line 207, in _wait_for_llm_server time.sleep(0.5) KeyboardInterrupt

问题复现最简单的逻辑这样的。

from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.transformers_utils.tokenizer import get_tokenizer

engine_args = AsyncEngineArgs(

model='/data/models/qwen/qwen1.5-2.7Bmoe', # ok

model='/data/models/Qwen/Qwen2-57B-A14B',  # bug
tokenizer_mode='auto',
trust_remote_code=True,
dtype='bfloat16',
tensor_parallel_size=2,
gpu_memory_utilization=0.90

)

engine = AsyncLLMEngine.from_engine_args(engine_args)

youkaichao commented 2 weeks ago

cc @WoosukKwon @pcmoritz does fused moe kernel also suffer from illegal memory access?

CXLiang123 commented 2 weeks ago

抄送@WoosukKwon @pcmoritz 融合的 moe 内核是否也会遭受非法内存访问?

我尝试qwen的1.5 moe 那个更小的参数量的是没问题的,换成qwen2的57B会出问题~

randxie commented 3 days ago

I tried to run Qwen2-57B-A14B with vllm and saw similar error message. I had to build the latest vllm and Triton from source to work around.