Open orderer0001 opened 2 weeks ago
after run export VLLM_ATTENTION_BACKEND=FLASHINFER
and add --disable-sliding-window
, I got the following error
ERROR 07-06 05:27:47 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'NoneType' object is not callable, Traceback (most recent call last)
vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
One should install FlashInfer manually @orderer0001
But, currently it can only use one GPU.
update: got Segmentation fault (core dumped)
after inferenced about 50 requests
@zjc17 do you know when and if Flashinfer will support more than one GPU?
I also got segmentation error with flashinfer=0.0.8 after some requests.
One should install FlashInfer manually @orderer0001
But, currently it can only use one GPU.
update: got
Segmentation fault (core dumped)
after inferenced about 50 requests
@zjc17 do you know when and if Flashinfer will support more than one GPU?
I haven't done much research on the framework itself. I'm guessing it's just a replacement for flash atten backend in this scenario, so the parallelism ability comes from the ray framework itself, which needs more compatibility testing.
The final explanation is left to the maintenaner team
I also got segmentation error with flashinfer=0.0.8 after some requests.
One should install FlashInfer manually @orderer0001 But, currently it can only use one GPU. update: got
Segmentation fault (core dumped)
after inferenced about 50 requests
I encountered a similar issue to yours, namely "Segmentation fault (core dumped)," but this problem appeared at the 3708th text out of 3822 texts that need to be inferred, and at the 12653rd text out of 12740 texts that need to be inferred. I look forward to this issue being resolved.
I also experienced Segmentation fault (core dumped)
but the situation is slightly different.
I am using the offline mode similar to the code posted here https://github.com/vllm-project/vllm/pull/5908#issuecomment-2216195148.
When the number of prompts is larger than 20, none of prompts seem to be processed and I would directly get Segmentation fault (core dumped)
.
Processed prompts: 0%| | 0/20 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s
evaluation_scripts/eval_vllm.sh: line 39: 1411590 Segmentation fault (core dumped)
If number of prompts is less than 20, everything works fine.
I am unable to try the approach mentioned by @LiuXiaoxuanPKU here https://github.com/vllm-project/vllm/issues/6252#issuecomment-2223512720 since I cannot compile FlashInfer
from source. ( I have H100 and sm90 supported but the compiler complains RuntimeError: FlashInfer requires sm75+
).
@Yutong-Dai wired, Flashinfer should support H100 (I tested locally with H100 & A100). Could you run the following command and see what it outputs?
nvidia-smi --query-gpu=compute_cap --format=csv
It should output 9.0 if it's h100.
@Yutong-Dai wired, Flashinfer should support H100 (I tested locally with H100 & A100). Could you run the following command and see what it outputs?
nvidia-smi --query-gpu=compute_cap --format=csv
It should output 9.0 if it's h100.
Hi @LiuXiaoxuanPKU, thanks for your timely reply. Upon using nvidia-smi --query-gpu=compute_cap --format=csv
, I got
compute_cap
9.0
9.0
9.0
9.0
9.0
9.0
9.0
9.0
More details:
for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
print(cuda_arch_flags)
I got
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_72,code=sm_72
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_80,code=sm_80
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_90,code=compute_90
-gencode=arch=compute_90,code=sm_90
my env
docker image: nvcr.io/nvidia/pytorch:24.01-py3
torch version: '2.3.0+cu121'
cuda: Cuda compilation tools, release 12.3, V12.3.107; Build cuda_12.3.r12.3/compiler.33567101_0
I'm using A100, here is the output
compute_cap
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
Please just install flashinfer 0.0.9 (https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.9) and report bugs if any. Thanks!
Please just install flashinfer 0.0.9 (https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.9) and report bugs if any. Thanks!
Thanks @LiuXiaoxuanPKU! flashinfer 0.0.9 fixes the Segmentation fault (core dumped)
. And everything works for single GPU case.
However, if I set tensor_parallel_size
to, say, 8, then I got
vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
and
[rank0]: RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor
I've tried to delete all files in /tmp
. The error still persists.
Detailed log
[rank0]: Traceback (most recent call last):
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 149, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 414, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1221, in execute_model
[rank0]: model_input.attn_metadata.begin_forward()
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/attention/backends/flashinfer.py", line 132, in begin_forward
[rank0]: self.prefill_wrapper.begin_forward(
[rank0]: File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/flashinfer/prefill.py", line 778, in begin_forward
[rank0]: self._wrapper.begin_forward(
[rank0]: RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor
INFO 07-13 04:00:56 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
I also get
RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor
only when using tensor parallelism
Your current environment
When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))🐛 Describe the bug
When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))