langchain-ai / langserve

LangServe 🦜️🏓
Other
1.93k stars 214 forks source link

LangServe Crash when multiple clients send requests. #263

Open tuobulatuo opened 11 months ago

tuobulatuo commented 11 months ago

When I send two or more requests to the server, it crashes, error logs below:

CUDA version: 11.7 NVDA Driver Version: 515.65.01

** On entry to SGEMM parameter number 13 had an illegal value

cuBLAS error 7 at /tmp/pip-install-_wvffp3m/llama-cpp-python_93b4c08269a545e2a4e8f946ea11d827/vendor/llama.cpp/ggml-cuda.cu:6140 current device: 0

CUDA error 4 at /tmp/pip-install-_wvffp3m/llama-cpp-python_93b4c08269a545e2a4e8f946ea11d827/vendor/llama.cpp/ggml-cuda.cu:455: driver shutting down current device: 0 ./bins/langchain_serve_test.sh: line 7: 311435 Segmentation fault (core dumped) python -u langchain_serve.py

eyurtsev commented 11 months ago

Hello @tuobulatuo, this does not look like a langserve issue -- the underlying model probably cannot handle concurrency, have you checked llama. I'll try to take a look later this week, but I think you'll need some sort of queue to handle the concurrent requests

this looks relevant https://github.com/ggerganov/llama.cpp/discussions/1871