vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.05k stars 3.82k forks source link

Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3,CUBLAS Error! #2798

Open tohneecao opened 7 months ago

tohneecao commented 7 months ago

INFO 02-07 11:14:13 llm_engine.py:70] Initializing an LLM engine with config: model='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=True, seed=0) INFO 02-07 11:14:18 llm_engine.py:275] # GPU blocks: 9200, # CPU blocks: 512 Wed, 07 Feb 2024 11:14:20 aiperf_inference.py[line:213] INFO LLM engine created Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [02:20<00:00, 3.87it/s][a9970a74a52a:279 :0:279] Caught signal 8 (Floating point exception: integer divide by zero) ==== backtrace (tid: 279) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x0000000000a0bc59 cublasLt_for_cublas_ZZZ() ???:0 2 0x0000000000814383 cublasLt_for_cublas_ZZZ() ???:0 3 0x00000000006ace72 cublasLtLegacyGemmUtilizationZZZ() ???:0 4 0x00000000007aa087 cublasLtMatmulAlgoCheck() ???:0 5 0x00000000007ab055 cublasLtMatmulAlgoCheck() ???:0 6 0x00000000007abd2e cublasLtMatmulAlgoCheck() ???:0 7 0x00000000007bd046 cublasLtHSHMatmulAlgoGetHeuristic() ???:0 8 0x000000000085d43a cublasXerbla() ???:0 9 0x000000000085deec cublasXerbla() ???:0 10 0x0000000000860122 cublasXerbla() ???:0 11 0x00000000008432ef cublasXerbla() ???:0 12 0x0000000000ac7ecf cublasUint8gemmBias() ???:0 13 0x0000000000ac83d8 cublasUint8gemmBias() ???:0 14 0x00000000003e1c7d cublasGemmEx() ???:0 15 0x000000000301f011 at::cuda::blas::gemm() :0 16 0x00000000030493c8 at::native::(anonymous namespace)::addmm_out_cuda_impl() Blas.cpp:0 17 0x000000000304988a at::native::structured_mm_out_cuda::impl() ???:0 18 0x0000000002dcc2e0 at::(anonymous namespace)::wrapper_CUDA_mm() RegisterCUDA.cpp:0 19 0x0000000002dcc350 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCUDA.cpp:0 20 0x0000000002782b11 at::_ops::mm::call() ???:0 21 0x0000000001b910d5 at::native::_matmul_impl() LinearAlgebra.cpp:0 22 0x0000000001b98729 at::native::matmul() ???:0 23 0x0000000002d059c0 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutogradmatmul>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCompositeImplicitAutograd.cpp:0 24 0x00000000028a4051 at::_ops::matmul::call() ???:0 25 0x0000000001b7fa33 at::native::linear() ???:0 26 0x0000000002d05753 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutogradlinear>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional const&)>::call() RegisterCompositeImplicitAutograd.cpp:0 27 0x00000000022fed9f at::_ops::linear::call() ???:0 28 0x000000000067775a torch::autograd::THPVariable_linear() python_nn_functions.cpp:0 29 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0 30 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 31 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 32 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 33 0x000000000014345c _PyEval_EvalFrameDefault() ???:0 34 0x000000000016893e PyMethod_New() ???:0 35 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 36 0x000000000016893e PyMethod_New() ???:0 37 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 38 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0 39 0x000000000016586c _PyObject_Call_Prepend() ???:0 40 0x0000000000280700 PyInitdatetime() ???:0 41 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 42 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 43 0x000000000016893e PyMethod_New() ???:0 44 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 45 0x000000000016893e PyMethod_New() ???:0 46 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 47 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0 48 0x000000000016586c _PyObject_Call_Prepend() ???:0 49 0x0000000000280700 PyInit__datetime() ???:0 50 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 51 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 52 0x000000000016893e PyMethod_New() ???:0 53 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 54 0x000000000016893e PyMethod_New() ???:0 55 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 56 0x000000000014fc14 _PyObject_Fast

The docker info:

image

Hap-Zhang commented 6 months ago

@tohneecao Hi, tohneecao. Have you solved this problem? I met the same problem in Nvidia H20 machine.

yiakwy-xpu-ml-framework-team commented 4 months ago

@Hap-Zhang It might relate to latest added chunk pre-fill feature.

Please use "--enforce-eager" mode, vLLM graph compiling is broken. With this on, you should expect 4600 toks/s in H20 single Card.

python benchmark.py --model "Llama-2-7b-chat-hf" --dataset -tp 1 --enforce-eager /workspace/tests_vllm/ShareGPT_V3_unfiltered_cleaned_split.json --kv-cache-dtype auto --dtype half --max-model-len 2048 --download-dir /workspace/tests_vllm/model

Note Flash attention should be >= v2.3

The benchmark data from vLLM team is incorrect.

umechand-amd commented 6 hours ago

I am facing the same issue, and it did not resolve after adding the --enforce_eager to the command. My flash-attn version is 2.4.2