Open tohneecao opened 9 months ago
@tohneecao Hi, tohneecao. Have you solved this problem? I met the same problem in Nvidia H20 machine.
@Hap-Zhang It might relate to latest added chunk pre-fill feature.
Please use "--enforce-eager" mode, vLLM graph compiling is broken. With this on, you should expect 4600 toks/s in H20 single Card.
python benchmark_throughput.py --model /workspace/tests_vllm/Llama-2-7b-chat-hf -tp 1 --enforce-eager --dataset /workspace/tests_vllm/ShareGPT_V3_unfiltered_cleaned_split.json --kv-cache-dtype auto --dtype half --max-model-len 2048
Note Flash attention should be >= v2.3
The benchmark data from vLLM team is incorrect.
I am facing the same issue, and it did not resolve after adding the --enforce_eager to the command. My flash-attn version is 2.4.2
I am facing the same issue, and it did not resolve after adding the --enforce_eager to the command. My flash-attn version is 2.4.2
@umechand-amd Vllm updates very quickly, I will check it in H20 again. Thank you for reporting this issue. B.t.w are you working on the MI30X machines ?
INFO 02-07 11:14:13 llm_engine.py:70] Initializing an LLM engine with config: model='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer='/root/local_model_root/model/llama-2-7b/modelscope/Llama-2-7b-chat-ms', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=True, seed=0) INFO 02-07 11:14:18 llm_engine.py:275] # GPU blocks: 9200, # CPU blocks: 512 Wed, 07 Feb 2024 11:14:20 aiperf_inference.py[line:213] INFO LLM engine created Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [02:20<00:00, 3.87it/s][a9970a74a52a:279 :0:279] Caught signal 8 (Floating point exception: integer divide by zero) ==== backtrace (tid: 279) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x0000000000a0bc59 cublasLt_for_cublas_ZZZ() ???:0 2 0x0000000000814383 cublasLt_for_cublas_ZZZ() ???:0 3 0x00000000006ace72 cublasLtLegacyGemmUtilizationZZZ() ???:0 4 0x00000000007aa087 cublasLtMatmulAlgoCheck() ???:0 5 0x00000000007ab055 cublasLtMatmulAlgoCheck() ???:0 6 0x00000000007abd2e cublasLtMatmulAlgoCheck() ???:0 7 0x00000000007bd046 cublasLtHSHMatmulAlgoGetHeuristic() ???:0 8 0x000000000085d43a cublasXerbla() ???:0 9 0x000000000085deec cublasXerbla() ???:0 10 0x0000000000860122 cublasXerbla() ???:0 11 0x00000000008432ef cublasXerbla() ???:0 12 0x0000000000ac7ecf cublasUint8gemmBias() ???:0 13 0x0000000000ac83d8 cublasUint8gemmBias() ???:0 14 0x00000000003e1c7d cublasGemmEx() ???:0 15 0x000000000301f011 at::cuda::blas::gemm() :0
16 0x00000000030493c8 at::native::(anonymous namespace)::addmm_out_cuda_impl() Blas.cpp:0
17 0x000000000304988a at::native::structured_mm_out_cuda::impl() ???:0
18 0x0000000002dcc2e0 at::(anonymous namespace)::wrapper_CUDA_mm() RegisterCUDA.cpp:0
19 0x0000000002dcc350 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCUDA.cpp:0
20 0x0000000002782b11 at::_ops::mm::call() ???:0
21 0x0000000001b910d5 at::native::_matmul_impl() LinearAlgebra.cpp:0
22 0x0000000001b98729 at::native::matmul() ???:0
23 0x0000000002d059c0 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd matmul>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call() RegisterCompositeImplicitAutograd.cpp:0
24 0x00000000028a4051 at::_ops::matmul::call() ???:0
25 0x0000000001b7fa33 at::native::linear() ???:0
26 0x0000000002d05753 c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutogradlinear>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::optional const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::optional const&)>::call() RegisterCompositeImplicitAutograd.cpp:0
27 0x00000000022fed9f at::_ops::linear::call() ???:0
28 0x000000000067775a torch::autograd::THPVariable_linear() python_nn_functions.cpp:0
29 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0
30 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
31 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0
32 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
33 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
34 0x000000000016893e PyMethod_New() ???:0
35 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
36 0x000000000016893e PyMethod_New() ???:0
37 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
38 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0
39 0x000000000016586c _PyObject_Call_Prepend() ???:0
40 0x0000000000280700 PyInit datetime() ???:0
41 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
42 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0
43 0x000000000016893e PyMethod_New() ???:0
44 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
45 0x000000000016893e PyMethod_New() ???:0
46 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
47 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0
48 0x000000000016586c _PyObject_Call_Prepend() ???:0
49 0x0000000000280700 PyInit__datetime() ???:0
50 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
51 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0
52 0x000000000016893e PyMethod_New() ???:0
53 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
54 0x000000000016893e PyMethod_New() ???:0
55 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
56 0x000000000014fc14 _PyObject_Fast
The docker info: