Closed JJJJerry closed 2 months ago
I build vllm from source, pre-release vtest. And I export VLLM_INSTALL_PUNICA_KERNELS=1. Additionally, I can run this with 2gpus (set tensor_parallel_size=2). However, when I run it with 4gpus, it failed.
same question with you
Hi, #5036 should be able to address your issue. You can clone the corresponding branch to test it.
Thanks, the branch "refactor-punica-kernel" works well.
Hi, #5036 should be able to address your issue. You can clone the corresponding branch to test it.
But there is a bug, when I run the above py the second time, it will cause an error :
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-04 20:27:56 config.py:703] Defaulting to use mp for distributed inference
INFO 07-04 20:27:56 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='/data03/irlab_share/Qwen/Qwen2-72B-Instruct', speculative_config=None, tokenizer='/data03/irlab_share/Qwen/Qwen2-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data03/irlab_share/Qwen/Qwen2-72B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=105102) INFO 07-04 20:27:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=105103) INFO 07-04 20:27:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=105101) INFO 07-04 20:27:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=105103) INFO 07-04 20:28:00 utils.py:720] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105103) INFO 07-04 20:28:00 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-04 20:28:00 utils.py:720] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105101) INFO 07-04 20:28:00 utils.py:720] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105102) INFO 07-04 20:28:00 utils.py:720] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=105101) INFO 07-04 20:28:00 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-04 20:28:00 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=105102) INFO 07-04 20:28:00 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-04 20:28:02 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=105101) WARNING 07-04 20:28:02 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=105102) WARNING 07-04 20:28:02 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=105103) WARNING 07-04 20:28:02 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=105103) INFO 07-04 20:28:18 model_runner.py:254] Loading model weights took 33.9833 GB
(VllmWorkerProcess pid=105102) INFO 07-04 20:28:18 model_runner.py:254] Loading model weights took 33.9833 GB
(VllmWorkerProcess pid=105101) INFO 07-04 20:28:19 model_runner.py:254] Loading model weights took 33.9833 GB
INFO 07-04 20:28:19 model_runner.py:254] Loading model weights took 33.9833 GB
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: [Errno 2] No such file or directory: '/data/wangchunhao-slurm/.triton/cache/098aa9b899bc0244743654c666c2e82a/_sgmv_shrink_kernel.json.tmp.pid_104939_42450', Traceback (most recent call last):
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/worker.py", line 175, in determine_num_available_blocks
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 849, in profile_run
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 1215, in execute_model
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 336, in forward
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 257, in forward
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] hidden_states, residual = layer(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 209, in forward
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] hidden_states = self.self_attn(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 153, in forward
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] qkv, _ = self.qkv_proj(hidden_states)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 511, in forward
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] output_parallel = self.apply(input_, bias)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 918, in apply
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] _apply_lora_packed_nslice(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 134, in _apply_lora_packed_nslice
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] add_lora(output,
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 231, in add_lora
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] add_shrink(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 84, in add_shrink
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] sgmv_shrink(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/ops/sgmv_shrink.py", line 162, in sgmv_shrink
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] _sgmv_shrink_kernel[grid](
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] self.cache[device][key] = compile(
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 202, in compile
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return CompiledKernel(so_path, metadata_group.get(metadata_filename))
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 230, in __init__
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] self.asm = {
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 231, in <dictcomp>
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.binary_ext else file.read_text()
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/pathlib.py", line 1134, in read_text
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] with self.open(mode='r', encoding=encoding, errors=errors) as f:
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] File "/data/wangchunhao-slurm/workspace/anaconda/envs/llama_factory/lib/python3.10/pathlib.py", line 1119, in open
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] return self._accessor.open(self, mode, buffering, encoding, errors,
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226] FileNotFoundError: [Errno 2] No such file or directory: '/data/wangchunhao-slurm/.triton/cache/098aa9b899bc0244743654c666c2e82a/_sgmv_shrink_kernel.json.tmp.pid_104939_42450'
(VllmWorkerProcess pid=105101) ERROR 07-04 20:28:41 multiproc_worker_utils.py:226]
I have to delete all the cache manully to run it again. And sometimes there is a decode error.
INFO 07-04 20:07:09 config.py:703] Defaulting to use mp for distributed inference
INFO 07-04 20:07:09 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='/data03/xxx_share/Qwen/Qwen2-72B-Instruct', speculative_config=None, tokenizer='/data03/xxx_share/Qwen/Qwen2-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data03/xxx_share/Qwen/Qwen2-72B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
[1;36m(VllmWorkerProcess pid=87017)[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=87018)[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=87016)[0;0m INFO 07-04 20:07:12 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
[1;36m(VllmWorkerProcess pid=87016)[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=87016)[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=87017)[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=87017)[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
[1;36m(VllmWorkerProcess pid=87018)[0;0m INFO 07-04 20:07:13 utils.py:720] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=87018)[0;0m INFO 07-04 20:07:13 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=87016)[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=87017)[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
[1;36m(VllmWorkerProcess pid=87018)[0;0m WARNING 07-04 20:07:15 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-04 20:07:32 model_runner.py:254] Loading model weights took 33.9833 GB
[1;36m(VllmWorkerProcess pid=87016)[0;0m INFO 07-04 20:07:32 model_runner.py:254] Loading model weights took 33.9833 GB
[1;36m(VllmWorkerProcess pid=87017)[0;0m INFO 07-04 20:07:33 model_runner.py:254] Loading model weights took 33.9833 GB
[1;36m(VllmWorkerProcess pid=87018)[0;0m INFO 07-04 20:07:33 model_runner.py:254] Loading model weights took 33.9833 GB
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'utf-8' codec can't decode byte 0xbe in position 18: invalid start byte, Traceback (most recent call last):
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/worker.py", line 175, in determine_num_available_blocks
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.model_runner.profile_run()
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 849, in profile_run
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.execute_model(model_input, kv_caches, intermediate_tensors)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/worker/model_runner.py", line 1215, in execute_model
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_or_intermediate_states = model_executable(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 336, in forward
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 257, in forward
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states, residual = layer(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 209, in forward
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] hidden_states = self.self_attn(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/model_executor/models/qwen2.py", line 153, in forward
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] qkv, _ = self.qkv_proj(hidden_states)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 511, in forward
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] output_parallel = self.apply(input_, bias)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 918, in apply
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] _apply_lora_packed_nslice(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/layers.py", line 134, in _apply_lora_packed_nslice
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] add_lora(output,
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 253, in add_lora
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] add_expand_slice(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/punica.py", line 159, in add_expand_slice
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] sgmv_expand_slice(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return func(*args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data05/project/LLaMA-Factory-20240624/vllm-refactor-punica-kernel/vllm/lora/ops/sgmv_expand_slice.py", line 178, in sgmv_expand_slice
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] _sgmv_expand_slice_kernel[grid](
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.cache[device][key] = compile(
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 202, in compile
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return CompiledKernel(so_path, metadata_group.get(metadata_filename))
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 230, in __init__
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] self.asm = {
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/site-packages/triton/compiler/compiler.py", line 231, in <dictcomp>
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.binary_ext else file.read_text()
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/pathlib.py", line 1135, in read_text
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] return f.read()
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] File "/data/xxx/workspace/anaconda/envs/llama_factory/lib/python3.10/codecs.py", line 322, in decode
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] (result, consumed) = self._buffer_decode(data, self.errors, final)
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 18: invalid start byte
[1;36m(VllmWorkerProcess pid=87016)[0;0m ERROR 07-04 20:07:55 multiproc_worker_utils.py:226]
When I run it with vllm's api server: python -m vllm.entrypoints.openai.api_server --... , the lora adapter seems no effect. The model seems like origion.But lora adapter works well when I run the above py.
This is a triton bug, refer to :https://github.com/vllm-project/vllm/issues/6103
Currently, you can temporarily avoid this error by setting distributed_executor_backend
to ray. For example:
llm = vllm.LLM(
MODEL_PATH,
enable_lora=True,
max_num_seqs=16,
max_loras=2,
trust_remote_code=True,
gpu_memory_utilization=0.3,
tensor_parallel_size=4,
distributed_executor_backend="ray"
)
When I run it with vllm's api server: python -m vllm.entrypoints.openai.api_server --... , the lora adapter seems no effect. The model seems like origion.But lora adapter works well when I run the above py.
I will check into this issue tomorrow
This is a triton bug, refer to :#6103
Currently, you can temporarily avoid this error by setting
distributed_executor_backend
to ray. For example:llm = vllm.LLM( MODEL_PATH, enable_lora=True, max_num_seqs=16, max_loras=2, trust_remote_code=True, gpu_memory_utilization=0.3, tensor_parallel_size=4, distributed_executor_backend="ray" )
set distributed_executor_backend="ray" This works for me.
When I run it with vllm's api server: python -m vllm.entrypoints.openai.api_server --... , the lora adapter seems no effect. The model seems like origion.But lora adapter works well when I run the above py.
I will check into this issue tomorrow
I run api server with llama-factory, and the adapter works well.
This should be resolved with the new landed Triton kernels https://github.com/vllm-project/vllm/pull/5036
Your current environment
🐛 Describe the bug
CUDA_VISIBLE_DEVICES=4,5,6,7 python vllm_qwen2_lora.py