Hi, I met the below issues when I try to serve GPT2 as in guide, any one could help me to check if this is a error relate configuration:
$ python examples/inference/api_server_openai/query_http_requests.py
chunk content: {"generated_text":null,"tool_calls":null,"num_input_tokens":null,"num_input_tokens_batch":null,"num_generated_tokens":null,"num_generated_tokens_batch":null,"preprocessing_time":null,"generation_time":null,"timestamp":1715708425.9959323,"finish_reason":null,"error":{"object":"error","message":"Internal Server Error","internal_message":"Internal Server Error","type":"InternalServerError","param":{},"code":500}}
Traceback (most recent call last):
File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 90, in <module>
raise e
File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 85, in <module>
choices = json.loads(chunk)["choices"]
~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'
check the logs with "ray logs cluster"
$ ray logs cluster worker-f07bb5a711a3c88ae7720f125167d0d2a7e64799231f33910f71ac72-01000000-1624109.err
--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. Set to -1 for getting the entire file. ---
:job_id:01000000
:actor_name:ServeReplica:router:PredictorDeployment
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
2024-05-14 13:39:51,629 - _logger.py - IPEX - WARNING - [NotSupported]fail to apply ipex.llm.optimize due to: Could not run 'ipex_prepack::linear_prepack' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'ipex_prepack::linear_prepack' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
CPU: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/jit/cpu/kernels/RegisterOpContextClass.cpp:192 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:154 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:324 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:297 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:378 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:244 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:202 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:162 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:166 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:158 [backend fallback]
, fallback to the origin model
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
ERROR 2024-05-14 13:40:25,987 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:359 - Request failed:
ray::ServeReplica:router:PredictorDeployment.handle_request_with_rejection() (pid=1624109, ip=10.97.102.172)
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/utils.py", line 168, in wrap_to_ray_error
raise exception
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1131, in call_user_method
result = await self._handle_user_method_result(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1038, in _handle_user_method_result
async for r in result:
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 444, in openai_call
yield await self.handle_non_streaming(input, config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 242, in handle_non_streaming
return await self.handle_dynamic_batch((input, config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 579, in batch_wrapper
return await enqueue_request(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 265, in _assign_func_results
results = await func_future
^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 269, in handle_dynamic_batch
batch_results = self.predictor.generate(prompts, **config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/yongqiang/llm-on-ray/llm_on_ray/inference/predictors/transformer_predictor.py", line 123, in generate
gen_tokens = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
outputs = self(
^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1305, in forward
transformer_outputs = self.transformer(
^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1119, in forward
outputs = block(
^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 616, in forward
hidden_states = self.ln_1(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 201, in forward
return F.layer_norm(
^^^^^^^^^^^^^
File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/functional.py", line 2573, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type BFloat16 but found Float
INFO 2024-05-14 13:40:25,988 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:373 - OPENAI_CALL ERROR 751.0ms
Below is the environment:
$ nvidia-smi
Tue May 14 13:42:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:1B:00.0 Off | Off |
| 0% 37C P8 30W / 450W | 17514MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off |
| 0% 38C P8 20W / 450W | 14502MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
I use the conda as virtual ENV, the python version is :
Hi, I met the below issues when I try to serve GPT2 as in guide, any one could help me to check if this is a error relate configuration:
Below is the environment:
I use the conda as virtual ENV, the python version is :
GPT2 configuration: