expected scalar type BFloat16 but found Float

Hi, I met the below issues when I try to serve GPT2 as in guide, any one could help me to check if this is a error relate configuration:

$ python examples/inference/api_server_openai/query_http_requests.py
chunk content:  {"generated_text":null,"tool_calls":null,"num_input_tokens":null,"num_input_tokens_batch":null,"num_generated_tokens":null,"num_generated_tokens_batch":null,"preprocessing_time":null,"generation_time":null,"timestamp":1715708425.9959323,"finish_reason":null,"error":{"object":"error","message":"Internal Server Error","internal_message":"Internal Server Error","type":"InternalServerError","param":{},"code":500}}
Traceback (most recent call last):
  File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 90, in <module>
    raise e
  File "/home/rcp_user/yongqiang/llm-on-ray/examples/inference/api_server_openai/query_http_requests.py", line 85, in <module>
    choices = json.loads(chunk)["choices"]
              ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

check the logs with "ray logs cluster"

$ ray logs cluster worker-f07bb5a711a3c88ae7720f125167d0d2a7e64799231f33910f71ac72-01000000-1624109.err
--- Log has been truncated to last 1000 lines. Use `--tail` flag to toggle. Set to -1 for getting the entire file. ---

:job_id:01000000
:actor_name:ServeReplica:router:PredictorDeployment
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2024-05-14 13:39:51,629 - _logger.py - IPEX - WARNING - [NotSupported]fail to apply ipex.llm.optimize due to: Could not run 'ipex_prepack::linear_prepack' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'ipex_prepack::linear_prepack' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/jit/cpu/kernels/RegisterOpContextClass.cpp:192 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:154 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:324 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:86 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:53 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:57 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:65 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:69 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:77 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:61 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:90 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:73 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:81 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:297 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:378 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:244 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:202 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:162 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:166 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:158 [backend fallback]
, fallback to the origin model
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
ERROR 2024-05-14 13:40:25,987 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:359 - Request failed:
ray::ServeReplica:router:PredictorDeployment.handle_request_with_rejection() (pid=1624109, ip=10.97.102.172)
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/utils.py", line 168, in wrap_to_ray_error
    raise exception
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1131, in call_user_method
    result = await self._handle_user_method_result(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/_private/replica.py", line 1038, in _handle_user_method_result
    async for r in result:
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 444, in openai_call
    yield await self.handle_non_streaming(input, config)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 242, in handle_non_streaming
    return await self.handle_dynamic_batch((input, config))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 579, in batch_wrapper
    return await enqueue_request(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/ray/serve/batching.py", line 265, in _assign_func_results
    results = await func_future
              ^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/llm_on_ray/inference/predictor_deployment.py", line 269, in handle_dynamic_batch
    batch_results = self.predictor.generate(prompts, **config)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/yongqiang/llm-on-ray/llm_on_ray/inference/predictors/transformer_predictor.py", line 123, in generate
    gen_tokens = self.model.generate(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/generation/utils.py", line 2791, in _sample
    outputs = self(
              ^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1305, in forward
    transformer_outputs = self.transformer(
                          ^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1119, in forward
    outputs = block(
              ^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 616, in forward
    hidden_states = self.ln_1(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/modules/normalization.py", line 201, in forward
    return F.layer_norm(
           ^^^^^^^^^^^^^
  File "/home/rcp_user/anaconda3/envs/ray/lib/python3.11/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type BFloat16 but found Float
INFO 2024-05-14 13:40:25,988 router_PredictorDeployment dh04lq3r 442f8811-5cef-45f9-916b-e4e964ef92dc /v1/chat/completions replica.py:373 - OPENAI_CALL ERROR 751.0ms

Below is the environment:

$ nvidia-smi
Tue May 14 13:42:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:1B:00.0 Off |                  Off |
|  0%   37C    P8              30W / 450W |  17514MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:3D:00.0 Off |                  Off |
|  0%   38C    P8              20W / 450W |  14502MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |

I use the conda as virtual ENV, the python version is :

$ python --version
Python 3.11.9

GPT2 configuration:

$ cat llm_on_ray/inference/models/gpt2.yaml
port: 8000
name: gpt2
route_prefix: /gpt2
num_replicas: 1
cpus_per_worker: 0
gpus_per_worker: 1
deepspeed: false
workers_per_group: 2
device: cuda
ipex:
  enabled: true
  precision: bf16
model_description:
  model_id_or_path: gpt2
  tokenizer_name_or_path: gpt2
  chat_processor: ChatModelGptJ
  gpt_base_model: true
  prompt:
    intro: ''
    human_id: ''
    bot_id: ''
    stop_words: []

intel / llm-on-ray

expected scalar type BFloat16 but found Float #223