Closed gmaimmortal closed 2 months ago
Hello @gmaimmortal! I'm here to assist you with any bugs, questions, or contributions. I'm looking forward to helping you out!
To resolve the "Server Unavailable" error caused by a CUDA kernel error during an API call to the system reasoning model configured for local deployment using Xinference, follow these steps:
Verify Model Compatibility: Ensure that the model you are deploying is compatible with your GPU. Refer to the Xinference built-in models documentation to check the compatibility of the model with your hardware.
Check CUDA Installation: Ensure that CUDA is properly installed and configured on your system. You can verify this by running a simple CUDA program or using the nvidia-smi
command to check the status of your GPU.
Update Drivers and Libraries: Make sure that your GPU drivers and CUDA libraries are up to date. Outdated drivers or libraries can cause compatibility issues.
Review Logs: Check the logs for more detailed error messages. The logs can provide specific information about what went wrong during the kernel execution.
Restart Xinference: Sometimes, simply restarting the Xinference service can resolve transient issues. Use the following commands to restart:
$ xinference-local stop
$ xinference-local start
Adjust Model Configuration: If the issue persists, you may need to adjust the model configuration or try deploying a different model that is known to be compatible with your hardware.
Additionally, ensure that you have correctly set up and deployed Xinference locally. Here are the steps to deploy Xinference locally:
Install Xinference: Use PyPI to install Xinference:
$ pip install "xinference[all]"
Start Local Deployment: Start Xinference in local deployment mode:
$ xinference-local
2023-08-20 19:21:05,265 xinference 10148 INFO Xinference successfully started. Endpoint: http://127.0.0.1:9997
2023-08-20 19:21:05,266 xinference.core.supervisor 10148 INFO Worker 127.0.0.1:37822 has been added successfully
2023-08-20 19:21:05,267 xinference.deploy.worker 10148 INFO Xinference worker successfully started.
By default, Xinference starts a local worker with the endpoint http://127.0.0.1:9997
. If you need external access, configure it with -H 0.0.0.0
.
Create and Deploy Model: Access http://127.0.0.1:9997
to select and deploy the model. Ensure the model is compatible with your hardware.
Obtain Model UID: Retrieve the model ID from the deployment page, e.g., 2c886330-8849-11ee-9518-43b0b8f40bea
.
Connect Model in Dify: In Dify, go to Settings > Model Provider > Xinference
and input the following:
vicuna-v1.3
http://<Machine_IP>:9997
(replace <Machine_IP>
with your machine's IP address)2c886330-8849-11ee-9518-43b0b8f40bea
Click "Save" to use the model in your application.
By following these steps and ensuring proper configuration, you should be able to resolve the "Server Unavailable" error caused by a CUDA kernel error during an API call to the system reasoning model configured for local deployment using Xinference [1][2][3].
To continue talking to Dosu, mention @dosu.
Please try to upgrade to latest version, we have already fixed this before.
Please try to upgrade to latest version, we have already fixed this before.
Dify has been updated to the main version, but the error still persists. Only the API calls are reporting this error, while the web conversation page is functioning normally.
I have the same question. Xinference is working on the web page. However, when I calling the API it well report the error "An Unexpected Error Occurred!"
Self Checks
Dify version
0.6.9
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
The system reasoning model is configured for local deployment with xinference
API call error occurred. detail:
error logs: dify-api error log: 2024-08-13 17:24:19 Exception in thread Thread-173 (full_text_index_search): 2024-08-13 17:24:19 Traceback (most recent call last): 2024-08-13 17:24:19 File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 2024-08-13 17:24:19 self.run() 2024-08-13 17:24:19 File "/usr/local/lib/python3.10/threading.py", line 953, in run 2024-08-13 17:24:19 self._target(*self._args, self._kwargs) 2024-08-13 17:24:19 File "/app/api/core/rag/datasource/retrieval_service.py", line 156, in full_text_index_search 2024-08-13 17:24:19 documents = vector_processor.search_by_full_text( 2024-08-13 17:24:19 File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 232, in search_by_full_text 2024-08-13 17:24:19 return self._vector_processor.search_by_full_text(query, kwargs) 2024-08-13 17:24:19 File "/app/api/core/rag/datasource/vdb/weaviate/weaviate_vector.py", line 234, in search_by_full_text 2024-08-13 17:24:19 result = query_obj.with_bm25(query=query, properties=properties).with_limit(kwargs.get('top_k', 2)).do() 2024-08-13 17:24:19 File "/usr/local/lib/python3.10/site-packages/weaviate/gql/get.py", line 1304, in do 2024-08-13 17:24:19 return super().do() 2024-08-13 17:24:19 File "/usr/local/lib/python3.10/site-packages/weaviate/gql/filter.py", line 81, in do 2024-08-13 17:24:19 raise UnexpectedStatusCodeException("Query was not successful", response) 2024-08-13 17:24:19 weaviate.exceptions.UnexpectedStatusCodeException: Query was not successful! Unexpected status code: 422, with response body: {'error': [{'message': 'no graphql provider present, this is most likely because no schema is present. Import a schema first!'}]}.
xinference error log: 2024-08-13 17:24:26 2024-08-13 09:24:26,586 xinference.api.restful_api 1 ERROR [address=0.0.0.0:35963, pid=367] CUDA error: device-side assert triggered 2024-08-13 17:24:26 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-08-13 17:24:26 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-08-13 17:24:26 Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions. 2024-08-13 17:24:26 Traceback (most recent call last): 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1477, in create_chat_completion 2024-08-13 17:24:26 data = await model.chat(prompt, system_prompt, chat_history, kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send 2024-08-13 17:24:26 return self._process_result_message(result) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message 2024-08-13 17:24:26 raise message.as_instanceof_cause() 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send 2024-08-13 17:24:26 result = await self._run_coro(message.message_id, coro) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro 2024-08-13 17:24:26 return await coro 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive 2024-08-13 17:24:26 return await super().on_receive(message) # type: ignore 2024-08-13 17:24:26 File "xoscar/core.pyx", line 558, in on_receive__ 2024-08-13 17:24:26 raise ex 2024-08-13 17:24:26 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive 2024-08-13 17:24:26 async with self._lock: 2024-08-13 17:24:26 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive 2024-08-13 17:24:26 with debug_async_timeout('actor_lock_timeout', 2024-08-13 17:24:26 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive 2024-08-13 17:24:26 result = await result 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped 2024-08-13 17:24:26 ret = await func(*args, kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 87, in wrapped_func 2024-08-13 17:24:26 ret = await fn(self, *args, *kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 462, in _wrapper 2024-08-13 17:24:26 r = await func(self, args, kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 473, in chat 2024-08-13 17:24:26 response = await self._call_wrapper( 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 111, in _async_wrapper 2024-08-13 17:24:26 return await fn(*args, kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 369, in _call_wrapper 2024-08-13 17:24:26 ret = await asyncio.to_thread(fn, *args, *kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/asyncio/threads.py", line 25, in to_thread 2024-08-13 17:24:26 return await loop.run_in_executor(None, func_call) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-08-13 17:24:26 result = self.fn(self.args, self.kwargs) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/pytorch/core.py", line 533, in chat 2024-08-13 17:24:26 c = self.generate(full_prompt, generate_config) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/pytorch/core.py", line 348, in generate 2024-08-13 17:24:26 for completion_chunk, completion_usage in generate_stream( 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context 2024-08-13 17:24:26 response = gen.send(None) 2024-08-13 17:24:26 File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/pytorch/utils.py", line 176, in generate_stream 2024-08-13 17:24:26 out = model(torch.as_tensor([input_ids], device=device), use_cache=True) 2024-08-13 17:24:26 RuntimeError: [address=0.0.0.0:35963, pid=367] CUDA error: device-side assert triggered 2024-08-13 17:24:26 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-08-13 17:24:26 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-08-13 17:24:26 Compile withTORCH_USE_CUDA_DSA
to enable device-side assertions.✔️ Expected Behavior
The API can be called normally
❌ Actual Behavior
No response