Open tuobulatuo opened 11 months ago
Hello @tuobulatuo, this does not look like a langserve issue -- the underlying model probably cannot handle concurrency, have you checked llama. I'll try to take a look later this week, but I think you'll need some sort of queue to handle the concurrent requests
this looks relevant https://github.com/ggerganov/llama.cpp/discussions/1871
When I send two or more requests to the server, it crashes, error logs below:
CUDA version: 11.7 NVDA Driver Version: 515.65.01
** On entry to SGEMM parameter number 13 had an illegal value
cuBLAS error 7 at /tmp/pip-install-_wvffp3m/llama-cpp-python_93b4c08269a545e2a4e8f946ea11d827/vendor/llama.cpp/ggml-cuda.cu:6140 current device: 0
CUDA error 4 at /tmp/pip-install-_wvffp3m/llama-cpp-python_93b4c08269a545e2a4e8f946ea11d827/vendor/llama.cpp/ggml-cuda.cu:455: driver shutting down current device: 0 ./bins/langchain_serve_test.sh: line 7: 311435 Segmentation fault (core dumped) python -u langchain_serve.py