mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.26k stars 1.58k forks source link

[Bug] large concurrency service broken #3005

Open fan-niu opened 3 weeks ago

fan-niu commented 3 weeks ago

🐛 Bug

The service started based on Meta-Llama-3.1-70B-Instruct fp8 will crash when running a large concurrency.

To Reproduce

convert model

refer this issue: #2982

start service

mlc_llm serve Meta-Llama-3.1-70B-Instruct-fp8-e4m3_e4m3_f16 \
    --model-lib Meta-Llama-3.1-70B-Instruct-fp8-e4m3_e4m3_f16/libs/h100-e4m3_e4m3_f16-cuda.so \
    --mode server \
    --host 127.0.0.1 \
    --port 8081 \
    --device cuda \
    --prefix-cache-mode disable \
    --enable-debug \
    --overrides "tensor_parallel_shards=2"

concurrency test

service broken: use 70 concurrency and avg input tokens = 2310, avg output tokens = 50

error messages

==== backtrace (tid:   1147) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000003200e60 tvm::runtime::relax_vm::PagedAttentionKVCacheObj::GetTotalSequenceLength()  ???:0
 2 0x00000000031e5274 tvm::runtime::TypedPackedFunc<int (tvm::runtime::relax_vm::AttentionKVCache)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::relax_vm::AttentionKVCache, tvm::runtime::relax_vm::AttentionKVCacheObj, int, , void>(int (tvm::runtime::relax_vm::AttentionKVCacheObj::*)() const)::{lambda(tvm::runtime::relax_vm::AttentionKVCache)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::relax_vm::AttentionKVCache, tvm::runtime::relax_vm::AttentionKVCacheObj, int, , void>(int (tvm::runtime::relax_vm::AttentionKVCacheObj::*)() const)::{lambda(tvm::runtime::relax_vm::AttentionKVCache)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()()  ???:0
 3 0x00000000031e5325 tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<int (tvm::runtime::relax_vm::AttentionKVCache)>::AssignTypedLambda<tvm::runtime::Registry::set_body_method<tvm::runtime::relax_vm::AttentionKVCache, tvm::runtime::relax_vm::AttentionKVCacheObj, int, , void>(int (tvm::runtime::relax_vm::AttentionKVCacheObj::*)() const)::{lambda(tvm::runtime::relax_vm::AttentionKVCache)#1}>(tvm::runtime::Registry::set_body_method<tvm::runtime::relax_vm::AttentionKVCache, tvm::runtime::relax_vm::AttentionKVCacheObj, int, , void>(int (tvm::runtime::relax_vm::AttentionKVCacheObj::*)() const)::{lambda(tvm::runtime::relax_vm::AttentionKVCache)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call()  ???:0
 4 0x00000000002d597d tvm::runtime::PackedFuncObj::CallPacked()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1398
 5 0x00000000002d597d tvm::runtime::PackedFunc::operator()<tvm::runtime::ObjectRef const&>()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1932
 6 0x00000000002d597d mlc::llm::serve::ModelImpl::GetCurrentTotalSequenceLength()  /workspace/mlc-llm/cpp/serve/model.cc:775
 7 0x000000000027bdec mlc::llm::serve::BatchPrefillBaseActionObj::GetRequestStateEntriesToPrefill()  /workspace/mlc-llm/cpp/serve/engine_actions/batch_prefill_base.cc:91
 8 0x000000000027bdec tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/object.h:404
 9 0x000000000027bdec tvm::runtime::ObjectRef::~ObjectRef()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/object.h:519
10 0x000000000027bdec mlc::llm::serve::Model::~Model()  /workspace/mlc-llm/cpp/serve/engine_actions/../model.h:366
11 0x000000000027bdec mlc::llm::serve::BatchPrefillBaseActionObj::GetRequestStateEntriesToPrefill()  /workspace/mlc-llm/cpp/serve/engine_actions/batch_prefill_base.cc:91
12 0x000000000029d4ac mlc::llm::serve::NewRequestPrefillActionObj::Step()  /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:35
13 0x000000000029d4ac std::_Vector_base<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput, std::allocator<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput> >::_Vector_impl_data::_M_swap_data()  /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_vector.h:123
14 0x000000000029d4ac std::vector<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput, std::allocator<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput> >::_M_move_assign()  /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_vector.h:1818
15 0x000000000029d4ac std::vector<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput, std::allocator<mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput> >::operator=()  /opt/rh/gcc-toolset-11/root/usr/include/c++/11/bits/stl_vector.h:714
16 0x000000000029d4ac mlc::llm::serve::NewRequestPrefillActionObj::Step()  /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:35
17 0x000000000024f9f9 mlc::llm::serve::EngineImpl::Step()  /workspace/mlc-llm/cpp/serve/engine.cc:594
18 0x000000000024f9f9 tvm::runtime::ObjectPtr<tvm::runtime::Object>::operator=()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/object.h:444
19 0x000000000024f9f9 tvm::runtime::Array<mlc::llm::serve::Request, void>::operator=()  /workspace/mlc-llm/3rdparty/tvm/include/tvm/runtime/container/array.h:362
20 0x000000000024f9f9 mlc::llm::serve::EngineImpl::Step()  /workspace/mlc-llm/cpp/serve/engine.cc:594
21 0x000000000030a093 mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()  /workspace/mlc-llm/cpp/serve/threaded_engine.cc:182
22 0x000000000030a093 std::atomic<bool>::load()  /opt/rh/gcc-toolset-11/root/usr/include/c++/11/atomic:112
23 0x000000000030a093 mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()  /workspace/mlc-llm/cpp/serve/threaded_engine.cc:138
24 0x000000000313ceba TVMFuncCall()  ???:0
25 0x0000000000022215 __pyx_f_3tvm_4_ffi_4_cy3_4core_FuncCall()  core.cpp:0
26 0x0000000000022af8 __pyx_pw_3tvm_4_ffi_4_cy3_4core_14PackedFuncBase_5__call__()  core.cpp:0
27 0x000000000016942b PyObject_Call()  ???:0
28 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
29 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
30 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
31 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
32 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
33 0x0000000000168a51 PyMethod_New()  ???:0
34 0x0000000000291f3a _PyDict_SetItem_KnownHash()  ???:0
35 0x0000000000286ef8 _PyObject_RealIsInstance()  ???:0
36 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
37 0x0000000000126850 __xmknodat()  ???:0
=================================
run_service_fp8_70b.sh: line 22:   889 Segmentation fault      (core dumped) mlc_llm serve $MLC_MODEL --model-lib $MLC_LIB --mode server --host $SERVER_ADDR --port $SERVER_PORT --device cuda --prefix-cache-mode disable --enable-debug --overrides "tensor_parallel_shards=2"

Expected behavior

Is there any way to keep the service from crashing other than lowering the concurrency? For example, concurrency requests that cannot be processed are evicted or the response speed of the return is reduced.

Environment

Additional context

A low concurrency service is normal. A high concurrency service will crash the service. Hope the service will not crash.

MasterJH5574 commented 3 weeks ago

Thanks for reporting. We will find time and try to reproduce it. Meanwhile, may I ask how often this segmentation fault happens? Does it happen every time you use “70 concurrency”?

fan-niu commented 2 weeks ago

@MasterJH5574 Yes, every time 70 concurrency is used, the service will crash. Thanks for your reply, looking forward good news

fan-niu commented 2 weeks ago

@MasterJH5574 hi Is there any progress on this issue? thanks

MasterJH5574 commented 1 week ago

@fan-niu Sorry we haven't got enough bandwidth to work on it. We will try best as early as possible.