vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.28k stars 3.31k forks source link

[Bug]: When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)` Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND')) #6166

Open orderer0001 opened 2 weeks ago

orderer0001 commented 2 weeks ago

Your current environment

When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))

🐛 Describe the bug

When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))

zjc17 commented 2 weeks ago

after run export VLLM_ATTENTION_BACKEND=FLASHINFER and add --disable-sliding-window, I got the following error

ERROR 07-06 05:27:47 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: 'NoneType' object is not callable, Traceback (most recent call last)

vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
zjc17 commented 2 weeks ago

One should install FlashInfer manually @orderer0001

But, currently it can only use one GPU.

update: got Segmentation fault (core dumped) after inferenced about 50 requests

lucafirefox commented 2 weeks ago

@zjc17 do you know when and if Flashinfer will support more than one GPU?

KwanWaiChung commented 2 weeks ago

I also got segmentation error with flashinfer=0.0.8 after some requests.

One should install FlashInfer manually @orderer0001

But, currently it can only use one GPU.

update: got Segmentation fault (core dumped) after inferenced about 50 requests

zjc17 commented 2 weeks ago

@zjc17 do you know when and if Flashinfer will support more than one GPU?

I haven't done much research on the framework itself. I'm guessing it's just a replacement for flash atten backend in this scenario, so the parallelism ability comes from the ray framework itself, which needs more compatibility testing.

The final explanation is left to the maintenaner team

Hi-archers commented 1 week ago

I also got segmentation error with flashinfer=0.0.8 after some requests.

One should install FlashInfer manually @orderer0001 But, currently it can only use one GPU. update: got Segmentation fault (core dumped) after inferenced about 50 requests

I encountered a similar issue to yours, namely "Segmentation fault (core dumped)," but this problem appeared at the 3708th text out of 3822 texts that need to be inferred, and at the 12653rd text out of 12740 texts that need to be inferred. I look forward to this issue being resolved.

Yutong-Dai commented 1 week ago

I also experienced Segmentation fault (core dumped) but the situation is slightly different.

I am using the offline mode similar to the code posted here https://github.com/vllm-project/vllm/pull/5908#issuecomment-2216195148.

When the number of prompts is larger than 20, none of prompts seem to be processed and I would directly get Segmentation fault (core dumped).

Processed prompts:   0%|       | 0/20 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s
evaluation_scripts/eval_vllm.sh: line 39: 1411590 Segmentation fault      (core dumped)

If number of prompts is less than 20, everything works fine.

I am unable to try the approach mentioned by @LiuXiaoxuanPKU here https://github.com/vllm-project/vllm/issues/6252#issuecomment-2223512720 since I cannot compile FlashInfer from source. ( I have H100 and sm90 supported but the compiler complains RuntimeError: FlashInfer requires sm75+).

LiuXiaoxuanPKU commented 1 week ago

@Yutong-Dai wired, Flashinfer should support H100 (I tested locally with H100 & A100). Could you run the following command and see what it outputs?

nvidia-smi --query-gpu=compute_cap --format=csv

It should output 9.0 if it's h100.

Yutong-Dai commented 1 week ago

@Yutong-Dai wired, Flashinfer should support H100 (I tested locally with H100 & A100). Could you run the following command and see what it outputs?

nvidia-smi --query-gpu=compute_cap --format=csv

It should output 9.0 if it's h100.

Hi @LiuXiaoxuanPKU, thanks for your timely reply. Upon using nvidia-smi --query-gpu=compute_cap --format=csv, I got

compute_cap
9.0
9.0
9.0
9.0
9.0
9.0
9.0
9.0

More details:

for cuda_arch_flags in torch_cpp_ext._get_cuda_arch_flags():
    print(cuda_arch_flags)

I got

-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_72,code=sm_72
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_80,code=sm_80
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_90,code=compute_90
-gencode=arch=compute_90,code=sm_90

my env

docker image: nvcr.io/nvidia/pytorch:24.01-py3
torch version: '2.3.0+cu121'
cuda: Cuda compilation tools, release 12.3, V12.3.107; Build cuda_12.3.r12.3/compiler.33567101_0
zjc17 commented 1 week ago

I'm using A100, here is the output

compute_cap
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
LiuXiaoxuanPKU commented 1 week ago

Please just install flashinfer 0.0.9 (https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.9) and report bugs if any. Thanks!

Yutong-Dai commented 1 week ago

Please just install flashinfer 0.0.9 (https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.9) and report bugs if any. Thanks!

Thanks @LiuXiaoxuanPKU! flashinfer 0.0.9 fixes the Segmentation fault (core dumped). And everything works for single GPU case.

However, if I set tensor_parallel_size to, say, 8, then I got

vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device

and

[rank0]: RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor

I've tried to delete all files in /tmp. The error still persists.

Detailed log

[rank0]: Traceback (most recent call last):
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 149, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 414, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 256, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 874, in profile_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1221, in execute_model
[rank0]:     model_input.attn_metadata.begin_forward()
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/vllm/attention/backends/flashinfer.py", line 132, in begin_forward
[rank0]:     self.prefill_wrapper.begin_forward(
[rank0]:   File "<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/site-packages/flashinfer/prefill.py", line 778, in begin_forward
[rank0]:     self._wrapper.begin_forward(
[rank0]: RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor
INFO 07-13 04:00:56 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
<my-path>miniconda3/envs/vllm_0.5.1/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
noamgat commented 4 days ago

I also get

RuntimeError: RuntimeError: Out of workspace memory in AlignedAlloactor

only when using tensor parallelism