mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
17.75k stars 1.41k forks source link

[Question] multiple gpu seting: Check failed num_running_rsentries <= engine_config_->max_num_sequence (81 vs. 80) : #2575

Open aaronlyt opened 3 weeks ago

aaronlyt commented 3 weeks ago

setting

server command: mlc_llm serve mlc-llama2-7b-q4 --overrides "tensor_parallel_shards=2" --mode server requst: request rate is 20 request/s gpu: a40

❓ General Questions

10 request/s is ok, 20 request/s rate had the problem below, can change some param setting make 20 request/s rate improve performance?

tvm.error.InternalError: Traceback (most recent call last): 4: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop() at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:182 3: mlc::llm::serve::EngineImpl::Step() at /workspace/mlc-llm/cpp/serve/engine.cc:619 2: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState) at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:35 1: mlc::llm::serve::BatchPrefillBaseActionObj::GetRequestStateEntriesToPrefill(mlc::llm::serve::EngineState) at /workspace/mlc-llm/cpp/serve/engine_actions/batch_prefill_base.cc:107 0: mlc::llm::serve::BatchPrefillBaseActionObj::CanPrefill(mlc::llm::serve::EngineState, int, int, int, int, int, int, mlc::llm::KVStateKind, bool) at /workspace/mlc-llm/cpp/serve/engine_actions/batch_prefill_base.cc:216 File "/workspace/mlc-llm/cpp/serve/engine_actions/batch_prefill_base.cc", line 216 InternalError: Check failed: num_running_rsentries <= engineconfig->max_num_sequence (81 vs. 80) :

MasterJH5574 commented 1 week ago

Thank you @aaronlyt for reporting. We will look into this issue.