After running the server, wait for a period of time.
model: mistral-large-instruct-2407-q4f16_1
"tensor_parallel_shards": 4,
mlcllm) a@aserver:~$ mlc_llm serve llm/mistral-large-instruct-2407-q4f16_1 --host 192.168.1.4
[2024-08-25 13:59:31] INFO auto_device.py:88: Not found device: cuda:0
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:0
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:1
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:2
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:3
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:4
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:5
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:6
[2024-08-25 13:59:33] INFO auto_device.py:79: Found device: rocm:7
[2024-08-25 13:59:34] INFO auto_device.py:88: Not found device: metal:0
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:0
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:1
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:2
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:3
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:4
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:5
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:6
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:7
[2024-08-25 13:59:36] INFO auto_device.py:79: Found device: vulkan:8
[2024-08-25 13:59:38] INFO auto_device.py:88: Not found device: opencl:0
[2024-08-25 13:59:38] INFO auto_device.py:35: Using device: rocm:0
[2024-08-25 13:59:38] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-08-25 13:59:38] INFO jit.py:158: Using cached model lib: /home/a/.cache/mlc_llm/model_lib/cfead2d711f56e44c7fd0fa68bddd3bd.so
[2024-08-25 13:59:38] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-08-25 13:59:38] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-08-25 13:59:38] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[13:59:41] /workspace/mlc-llm/cpp/serve/config.cc:687: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048.
[13:59:41] /workspace/mlc-llm/cpp/serve/config.cc:687: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 41729, prefill chunk size will be set to 2048.
[13:59:41] /workspace/mlc-llm/cpp/serve/config.cc:687: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 41260, prefill chunk size will be set to 2048.
[13:59:41] /workspace/mlc-llm/cpp/serve/config.cc:768: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[13:59:41] /workspace/mlc-llm/cpp/serve/config.cc:773: Estimated total single GPU memory usage: 17995.347 MB (Parameters: 16771.148 MB. KVCache: 778.401 MB. Temporary buffer: 445.798 MB). The actual usage might be slightly larger than the estimated number.
[13:59:41] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #0] Loading model to device: rocm:0
[13:59:41] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #1] Loading model to device: rocm:1
[13:59:41] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #2] Loading model to device: rocm:2
[13:59:41] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #3] Loading model to device: rocm:3
[13:59:41] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:175: Loading parameters...
[==================================================================================================>] [885/885]
[14:01:06] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:203: Loading done. Time used: Loading 76.568 s Preprocessing 8.240 s.
INFO: Started server process [15112]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 192.168.1.9:55521 - "OPTIONS /v1/chat/completions HTTP/1.1" 200 OK
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/a/miniconda3/envs/mlcllm/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
File "/workspace/mlc-llm/cpp/serve/engine.cc", line 650, in mlc::llm::serve::EngineImpl::Step()
File "/workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc", line 45, in mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
File "/workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc", line 301, in mlc::llm::serve::NewRequestPrefillActionObj::MatchPrefixCache(mlc::llm::serve::EngineState, mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput*)
File "/workspace/mlc-llm/cpp/serve/model.cc", line 642, in mlc::llm::serve::ModelImpl::AddNewSequence(long)
File "/workspace/mlc-llm/cpp/serve/function_table.cc", line 68, in operator()
tvm.error.InternalError: Traceback (most recent call last):
9: mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
at /workspace/mlc-llm/cpp/serve/threaded_engine.cc:182
8: mlc::llm::serve::EngineImpl::Step()
at /workspace/mlc-llm/cpp/serve/engine.cc:650
7: mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:45
6: mlc::llm::serve::NewRequestPrefillActionObj::MatchPrefixCache(mlc::llm::serve::EngineState, mlc::llm::serve::BatchPrefillBaseActionObj::PrefillInput*)
at /workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc:301
5: mlc::llm::serve::ModelImpl::AddNewSequence(long)
at /workspace/mlc-llm/cpp/serve/model.cc:642
4: operator()
at /workspace/mlc-llm/cpp/serve/function_table.cc:68
3: tvm::runtime::BcastSessionObj::CallWithPacked(tvm::runtime::TVMArgs const&)
2: tvm::runtime::ProcessSessionObj::BroadcastPacked(tvm::runtime::TVMArgs const&)
1: tvm::support::Pipe::Write(void const*, unsigned long)
0: _ZN3tvm7runtime6deta
File "/workspace/tvm/src/runtime/disco/../../support/pipe.h", line 129
InternalError: Check failed: nwrite != -1 (-1 vs. -1) : Write Error: Broken pipe
How you installed MLC-LLM (conda, source):python pre-built package
How you installed TVM-Unity (pip, source):
Python version (e.g. 3.10):3.11
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
On our side such “broken pipe” error sometimes happens but rather rarely. On one hand we are working on finding the cause, and on the other hand you can kill the processes and rerun the server.
🐛 Bug
To Reproduce
Expected behavior
Environment
conda
, source):python pre-built packagepip
, source):python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):