Open fan-niu opened 3 weeks ago
Thanks for reporting. We will find time and try to reproduce it. Meanwhile, may I ask how often this segmentation fault happens? Does it happen every time you use “70 concurrency”?
@MasterJH5574 Yes, every time 70 concurrency is used, the service will crash. Thanks for your reply, looking forward good news
@MasterJH5574 hi Is there any progress on this issue? thanks
@fan-niu Sorry we haven't got enough bandwidth to work on it. We will try best as early as possible.
🐛 Bug
The service started based on Meta-Llama-3.1-70B-Instruct fp8 will crash when running a large concurrency.
To Reproduce
convert model
refer this issue: #2982
start service
concurrency test
service broken: use 70 concurrency and avg input tokens = 2310, avg output tokens = 50
error messages
Expected behavior
Is there any way to keep the service from crashing other than lowering the concurrency? For example, concurrency requests that cannot be processed are evicted or the response speed of the return is reduced.
Environment
conda
, source): condapip
, source): pippython -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
A low concurrency service is normal. A high concurrency service will crash the service. Hope the service will not crash.