Closed yao531441 closed 2 months ago
@tianyil1
As discussed with @yao531441 , parameter use_hpu_graphs is set to False by default when serving models. https://github.com/opea-project/GenAIComps/blob/4f3438215a5dfe1d748c394bd2384f00a5ba23e0/comps/llms/text-generation/ray_serve/serve.py#L323
When using LLama2 to generate 1000 tokens, it takes 20s when use_hpu_graphs is True, and 44s when use_hpu_graphs is False. In addition, deploying Phi3 model will report an error when use_hpu_graph is True:
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
(ServeReplica:router:Router pid=791721) return wrapped_hpugraph_forward(
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
(ServeReplica:router:Router pid=791721) outputs = orig_fwd(*args, **kwargs)
(ServeReplica:router:Router pid=791721) File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-mini-4k-instruct/ff07dc01615f8113924aed013115ab2abd32115b/modeling_phi3.py", line 1286, in forward
(ServeReplica:router:Router pid=791721) outputs = self.model(
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
(ServeReplica:router:Router pid=791721) return self._call_impl(*args, **kwargs)
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
(ServeReplica:router:Router pid=791721) result = forward_call(*args, **kwargs)
(ServeReplica:router:Router pid=791721) File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-mini-4k-instruct/ff07dc01615f8113924aed013115ab2abd32115b/modeling_phi3.py", line 1134, in forward
(ServeReplica:router:Router pid=791721) attention_mask = _prepare_4d_causal_attention_mask(
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 307, in _prepare_4d_causal_attention_mask
(ServeReplica:router:Router pid=791721) attention_mask = attn_mask_converter.to_4d(
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 121, in to_4d
(ServeReplica:router:Router pid=791721) causal_4d_mask = self._make_causal_mask(
(ServeReplica:router:Router pid=791721) File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 169, in _make_causal_mask
(ServeReplica:router:Router pid=791721) context_mask = 1 - torch.triu(torch.ones_like(mask, dtype=torch.int), diagonal=diagonal)
(ServeReplica:router:Router pid=791721) RuntimeError: cpu fallback is not supported during hpu graph capturing
This may be related to Only models that run completely on HPU have been tested. Models that contain CPU ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”
described in Inference_Using_HPU_Graphs.
We support use vllm-on-ray solution. There is no plan to support deepspeed in ray-serve.
Hi, I tried using Ray-serve to deploy llama3 and phi3 models, but it seems that there are still some issues.
1. Rayserve doesn't support multiple cards now
LLama2-70b or Llama3-70b can't run on Ray-serve.
2. Phi3 models have CPU occupancy rate issue
When using the Phi3 model, the inference time is too long. The CPU occupancy rate is very high, and it is suspected that the model did not actually run on Gaudi.