Open dkiran1 opened 6 months ago
@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.
I used openai==0.28 version, since latest version gave error and recommoneded to use this version
@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.
ok, I'll reproduce it soon
@dkiran1 Thank you for your reporting. If you want to use Openai compatible sdk, please remove --simple parameter. After serving, please set ENDPOINT_URL=http://localhost:8000/v1
when running query_http_requests.py or set OPENAI_API_BASE=http://localhost:8000/v1
when running query_open_sdk.py. And you can see serve.md for more details.
Hi Yan, Thanks for the details, I tried the above mentioned steps, I could run inference server with falcon model, but on running python examples/inference/api_server_openai/query_openai_sdk.py --model_name="falcon-7b" Its waiting for the response from long time, but no response, I tried with neural chat model, yestuday it was working on upgrading transformer library , but its giving error
d lead to undefined behavior!
(ServeController pid=11891) ERROR 2024-01-19 05:35:26,615 controller 11891 deployment_state.py:672 - Exception in replica 'neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36', the replica will be stopped.
(ServeController pid=11891) Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deployment_state.py", line 670, in checkready
(ServeController pid=11891) , self._version = ray.get(self._ready_obj_ref)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=11891) return fn(*args, kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=11891) return func(*args, *kwargs)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2656, in get
(ServeController pid=11891) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 869, in get_objects
(ServeController pid=11891) raise value.as_instanceof_cause()
(ServeController pid=11891) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:neural-chat-7b-v3-1:PredictorDeployment.initialize_and_get_metadata() (pid=18013, ip=172.17.0.2, actor_id=685216a503325bcc4e3c3c7701000000, repr=<ray.serve._private.replica.ServeReplica:neural-chat-7b-v3-1:PredictorDeployment object at 0x7fabd93efd00>)
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
(ServeController pid=11891) return self.get_result()
(ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result
(ServeController pid=11891) raise self._exception
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata
(ServeController pid=11891) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=11891) RuntimeError: Traceback (most recent call last):
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata
(ServeController pid=11891) await self._user_callable_wrapper.initialize_callable()
(ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable
(ServeController pid=11891) await self._call_func_or_gen(
(ServeController pid=11891) result = callable(args, kwargs)
(ServeController pid=11891) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init
(ServeController pid=11891) self.predictor = TransformerPredictor(infer_conf)
(ServeController pid=11891) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init
(ServeController pid=11891) from optimum.habana.transformers.modeling_utils import (
(ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in
Hi @dkiran1 , we currently have limited bandwidth and hardware to test on Gaudi. Currently the Gaudi related part is not up to date. I just tested in docker, in vault.habana.ai/gaudi-docker/1.13.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.0
container, you only need to
# install llm-on-ray, assume mounted
pip install -e .
# install latest optimum[habana]
pip install optimum[habana]
Make sure tranformers version is 4.34.1, which is required by optimum[habana], and caused your error. In addition, inference with gaudi does not require IPEX
Hi Lin, Thanks a lot after doing pip install optimum[habana] neural-chat model along with query_openai_sdk is working fine. I will test other models and will post the status
I tested falcon-7b,mpt-7b,mistral-7b and neural-chat model ,I could run inference server of these models , Iam getting response for neural-chat and mistral-7b model with query_openai_sdk.py , but its waiting for resposne for mpt-7b and flacon model
Hi @dkiran1 , When you use openai serving, try add --max_new_tokens config. It seems like optimum-habana requires this config. I'll look into why and how to fix this later.
I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with below command python inference/serve.py --config_file inference/models/neural-chat-7b-v3-1.yaml --simple python inference/serve.py --config_file inference/models/falcon-7b.yaml --simple I could run the test infernce with python examples/inference/api_server_simple/query_single.py --model_endpoint http://172.17.0.2:8000/neural-chat-7b-v3-1 I exported export OPENAI_API_BASE=http://172.17.0.2:8000/falcon-7b export OPENAI_API_KEY=
and tried to run python examples/inference/api_server_openai/query_openai_sdk.py, Iam getting belwo error
File "/root/llm-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 45, in
models = openai.Model.list()
File "/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/listable_apiresource.py", line 60, in list
response, , api_key = requestor.request(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 757, in _interpret_response_line
raise error.APIError(
openai.error.APIError: HTTP code 500 from API (Unexpected error, traceback: ray::ServeReplica:falcon-7b:PredictorDeployment.handle_request_streaming() (pid=15684, ip=172.17.0.2)
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error
raise exception
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 994, in call_user_method
await self._call_func_or_gen(
File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 750, in _call_func_or_gen
result = await result
File "/root/llm-ray/inference/predictor_deployment.py", line 84, in call
json_request: Dict[str, Any] = await http_request.json()
File "/usr/local/lib/python3.10/dist-packages/starlette/requests.py", line 244, in json
self._json = json.loads(body)
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).)
I installed open-api 0.28.0 version, Please let me know what could be the isssue, Iam I missing any installations?