Closed ita9naiwa closed 4 months ago
It works well when there are less than 4 concurrent requests
Hi @ita9naiwa thanks for reporting! Could you try to update your local to the latest commit? Commit https://github.com/apache/tvm/commit/18a2a250f8c7f16f5f5be6753861ba5db8fb89fa can address this issue, if your tvm is behind this commit.
Sure! I'll try
@MasterJH5574 Hi, I tested with the latest tvm and it works well.
Thanks!
đ Bug
When mlc_llm serve is running as server mode and it receives more than 4 queries, it shows following error
To Reproduce
Steps to reproduce the behavior:
CUDA_VISIBLE_DEVICES=1 mlc_llm serve --mode server \ --model-lib llama-7b/llama-7b-cuda.so \ llama-7b
Expected behavior
Environment
conda
, source): via pip (python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu121 mlc-ai-nightly-cu121)pip
, source): via pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
I compiled llama-2-7B via tvm using following scripts
and run mlc llm via
CUDA_VISIBLE_DEVICES=1 mlc_llm serve --mode server \ --model-lib llama-7b/llama-7b-cuda.so \ llama-7b