opea-project / GenAIComps

GenAI components at micro-service level; GenAI service composer to create mega-service
Apache License 2.0
52 stars 111 forks source link

Ray-serve can't deploy llm model on multi Gaudi cards and run Phi3 model have a CPU occupancy rate issue #98

Closed yao531441 closed 2 months ago

yao531441 commented 4 months ago

Hi, I tried using Ray-serve to deploy llama3 and phi3 models, but it seems that there are still some issues.

1. Rayserve doesn't support multiple cards now

LLama2-70b or Llama3-70b can't run on Ray-serve. Screenshot 2024-05-27 163003

2. Phi3 models have CPU occupancy rate issue

When using the Phi3 model, the inference time is too long. The CPU occupancy rate is very high, and it is suspected that the model did not actually run on Gaudi.

yao531441 commented 4 months ago

@tianyil1

KepingYan commented 3 months ago

As discussed with @yao531441 , parameter use_hpu_graphs is set to False by default when serving models. https://github.com/opea-project/GenAIComps/blob/4f3438215a5dfe1d748c394bd2384f00a5ba23e0/comps/llms/text-generation/ray_serve/serve.py#L323

When using LLama2 to generate 1000 tokens, it takes 20s when use_hpu_graphs is True, and 44s when use_hpu_graphs is False. In addition, deploying Phi3 model will report an error when use_hpu_graph is True:

(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 716, in forward
(ServeReplica:router:Router pid=791721)     return wrapped_hpugraph_forward(
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 594, in wrapped_hpugraph_forward
(ServeReplica:router:Router pid=791721)     outputs = orig_fwd(*args, **kwargs)
(ServeReplica:router:Router pid=791721)   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-mini-4k-instruct/ff07dc01615f8113924aed013115ab2abd32115b/modeling_phi3.py", line 1286, in forward
(ServeReplica:router:Router pid=791721)     outputs = self.model(
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
(ServeReplica:router:Router pid=791721)     return self._call_impl(*args, **kwargs)
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
(ServeReplica:router:Router pid=791721)     result = forward_call(*args, **kwargs)
(ServeReplica:router:Router pid=791721)   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-mini-4k-instruct/ff07dc01615f8113924aed013115ab2abd32115b/modeling_phi3.py", line 1134, in forward
(ServeReplica:router:Router pid=791721)     attention_mask = _prepare_4d_causal_attention_mask(
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 307, in _prepare_4d_causal_attention_mask
(ServeReplica:router:Router pid=791721)     attention_mask = attn_mask_converter.to_4d(
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 121, in to_4d
(ServeReplica:router:Router pid=791721)     causal_4d_mask = self._make_causal_mask(
(ServeReplica:router:Router pid=791721)   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_attn_mask_utils.py", line 169, in _make_causal_mask
(ServeReplica:router:Router pid=791721)     context_mask = 1 - torch.triu(torch.ones_like(mask, dtype=torch.int), diagonal=diagonal)
(ServeReplica:router:Router pid=791721) RuntimeError: cpu fallback is not supported during hpu graph capturing

This may be related to Only models that run completely on HPU have been tested. Models that contain CPU ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing” described in Inference_Using_HPU_Graphs.

yao531441 commented 2 months ago

We support use vllm-on-ray solution. There is no plan to support deepspeed in ray-serve.