Getting error while executing query_openai_sdk.py to test the inference

dkiran1 commented 6 months ago

I ran the infernce of Falcon-7b and neural-chat-7b-v3-1 models on ray server with below command python inference/serve.py --config_file inference/models/neural-chat-7b-v3-1.yaml --simple python inference/serve.py --config_file inference/models/falcon-7b.yaml --simple I could run the test infernce with python examples/inference/api_server_simple/query_single.py --model_endpoint http://172.17.0.2:8000/neural-chat-7b-v3-1 I exported export OPENAI_API_BASE=http://172.17.0.2:8000/falcon-7b export OPENAI_API_KEY= and tried to run python examples/inference/api_server_openai/query_openai_sdk.py, Iam getting belwo error

File "/root/llm-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 45, in models = openai.Model.list() File "/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/listable_apiresource.py", line 60, in list response, , api_key = requestor.request( File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 298, in request resp, got_stream = self._interpret_response(result, stream) File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 700, in _interpret_response self._interpret_response_line( File "/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py", line 757, in _interpret_response_line raise error.APIError( openai.error.APIError: HTTP code 500 from API (Unexpected error, traceback: ray::ServeReplica:falcon-7b:PredictorDeployment.handle_request_streaming() (pid=15684, ip=172.17.0.2) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/utils.py", line 165, in wrap_to_ray_error raise exception File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 994, in call_user_method await self._call_func_or_gen( File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 750, in _call_func_or_gen result = await result File "/root/llm-ray/inference/predictor_deployment.py", line 84, in call json_request: Dict[str, Any] = await http_request.json() File "/usr/local/lib/python3.10/dist-packages/starlette/requests.py", line 244, in json self._json = json.loads(body) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).)

I installed open-api 0.28.0 version, Please let me know what could be the isssue, Iam I missing any installations?

xwu99 commented 6 months ago

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

dkiran1 commented 6 months ago

I used openai==0.28 version, since latest version gave error and recommoneded to use this version

yutianchen666 commented 6 months ago

@yutianchen666 Could you help to reproduce the issue? I am not sure if it is OpenAI version causing api break.

ok, I'll reproduce it soon

KepingYan commented 6 months ago

@dkiran1 Thank you for your reporting. If you want to use Openai compatible sdk, please remove --simple parameter. After serving, please set ENDPOINT_URL=http://localhost:8000/v1 when running query_http_requests.py or set OPENAI_API_BASE=http://localhost:8000/v1 when running query_open_sdk.py. And you can see serve.md for more details.

dkiran1 commented 6 months ago

Hi Yan, Thanks for the details, I tried the above mentioned steps, I could run inference server with falcon model, but on running python examples/inference/api_server_openai/query_openai_sdk.py --model_name="falcon-7b" Its waiting for the response from long time, but no response, I tried with neural chat model, yestuday it was working on upgrading transformer library , but its giving error

d lead to undefined behavior! (ServeController pid=11891) ERROR 2024-01-19 05:35:26,615 controller 11891 deployment_state.py:672 - Exception in replica 'neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36', the replica will be stopped. (ServeController pid=11891) Traceback (most recent call last): (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/deployment_state.py", line 670, in checkready (ServeController pid=11891) , self._version = ray.get(self._ready_obj_ref) (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper (ServeController pid=11891) return fn(*args, kwargs) (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper (ServeController pid=11891) return func(*args, *kwargs) (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2656, in get (ServeController pid=11891) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 869, in get_objects (ServeController pid=11891) raise value.as_instanceof_cause() (ServeController pid=11891) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:neural-chat-7b-v3-1:PredictorDeployment.initialize_and_get_metadata() (pid=18013, ip=172.17.0.2, actor_id=685216a503325bcc4e3c3c7701000000, repr=<ray.serve._private.replica.ServeReplica:neural-chat-7b-v3-1:PredictorDeployment object at 0x7fabd93efd00>) (ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result (ServeController pid=11891) return self.get_result() (ServeController pid=11891) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result (ServeController pid=11891) raise self._exception (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 570, in initialize_and_get_metadata (ServeController pid=11891) raise RuntimeError(traceback.format_exc()) from None (ServeController pid=11891) RuntimeError: Traceback (most recent call last): (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 554, in initialize_and_get_metadata (ServeController pid=11891) await self._user_callable_wrapper.initialize_callable() (ServeController pid=11891) File "/usr/local/lib/python3.10/dist-packages/ray/serve/_private/replica.py", line 778, in initialize_callable (ServeController pid=11891) await self._call_func_or_gen( (ServeController pid=11891) result = callable(args, kwargs) (ServeController pid=11891) File "/root/llm-ray/inference/predictor_deployment.py", line 64, in init (ServeController pid=11891) self.predictor = TransformerPredictor(infer_conf) (ServeController pid=11891) File "/root/llm-ray/inference/transformer_predictor.py", line 22, in init (ServeController pid=11891) from optimum.habana.transformers.modeling_utils import ( (ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/modeling_utils.py", line 19, in (ServeController pid=11891) from .models import ( (ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/init.py", line 59, in (ServeController pid=11891) from .mpt import ( (ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/init.py", line 1, in (ServeController pid=11891) from .modeling_mpt import ( (ServeController pid=11891) File "/root/optimum-habana/optimum/habana/transformers/models/mpt/modeling_mpt.py", line 24, in (ServeController pid=11891) from transformers.models.mpt.modeling_mpt import MptForCausalLM, MptModel, _expand_mask, _make_causal_mask (ServeController pid=11891) ImportError: cannot import name '_expand_mask' from 'transformers.models.mpt.modeling_mpt' (/usr/local/lib/python3.10/dist-packages/transformers/models/mpt/modeling_mpt.py) (ServeController pid=11891) INFO 2024-01-19 05:35:27,338 controller 11891 deployment_state.py:2188 - Replica neural-chat-7b-v3-1#PredictorDeployment#3jmxrf36 is stopped. (ServeController pid=11891) INFO 2024-01-19 05:35:27,339 controller 11891 deployment_state.py:1850 - Adding 1 replica to deployment PredictorDeployment in application 'neural-chat-7b-v3-1'. exit(ServeReplica:router:PredictorDeployment pid=18206) /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations (ServeReplica:router:PredictorDeployment pid=18206) warnings.warn( (ServeReplica:neural-chat-7b-v3-1:PredictorDeployment pid=18013) [WARNING|utils.py:190] 2024-01-19 05:35:26,443 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but the driver version is v1.13.0, this could lead to undefined behavior!

kira-lin commented 6 months ago

Hi @dkiran1 , we currently have limited bandwidth and hardware to test on Gaudi. Currently the Gaudi related part is not up to date. I just tested in docker, in vault.habana.ai/gaudi-docker/1.13.0/ubuntu22.04/habanalabs/pytorch-installer-2.1.0 container, you only need to

# install llm-on-ray, assume mounted
pip install -e .
# install latest optimum[habana]
pip install optimum[habana]

Make sure tranformers version is 4.34.1, which is required by optimum[habana], and caused your error. In addition, inference with gaudi does not require IPEX

dkiran1 commented 6 months ago

Hi Lin, Thanks a lot after doing pip install optimum[habana] neural-chat model along with query_openai_sdk is working fine. I will test other models and will post the status

dkiran1 commented 6 months ago

I tested falcon-7b,mpt-7b,mistral-7b and neural-chat model ,I could run inference server of these models , Iam getting response for neural-chat and mistral-7b model with query_openai_sdk.py , but its waiting for resposne for mpt-7b and flacon model

kira-lin commented 6 months ago

Hi @dkiran1 , When you use openai serving, try add --max_new_tokens config. It seems like optimum-habana requires this config. I'll look into why and how to fix this later.

intel / llm-on-ray

Getting error while executing query_openai_sdk.py to test the inference #66