opea-project / GenAIInfra

Containerization and cloud native suite for OPEA
Apache License 2.0
16 stars 22 forks source link

Can not deploy TGI and TEI services in my Gaudi Kubernetes Cluster based on manifest Yaml files #112

Closed zhlsunshine closed 1 week ago

zhlsunshine commented 1 week ago

I can not deploy TGI and TEI service in my Gaudi Kubernetes cluster via GMC, even just use kubectl apply as well, such as: kubectl apply -f qna_configmap_gaudi.yaml kubectl apply -f tei_embedding_gaudi_service.yaml kubectl apply -f tgi_gaudi_service.yaml I have validated this with @leslieluyu, and the result is below:

image

I can exclude the gaudi node problem because I can successfully deploy habanalabs-gaudi-demo based on example (but change the Job into Pod), I have validated this with @lianhao.

image

Strangely, @leslieluyu show me that the microservice is okay in their Kubernetes Gaudi node. I open the issue to track this problem, still need Luyu's help on this.

tei-gaudi-embedding-svc-deployment.log tgi-gaudi-svc-deployment.log

zhlsunshine commented 1 week ago

Thanks for @leslieluyu's quick action and @lvliang-intel 's involving, the root cause is clear, please make sure that the SW stack is correct. For example, if the Firmware is 1.15.0, please make sure your SW stack versions are 1.15.0 or 1.15.1. However, it may upgrade your SW stack versions to 1.16.0 or 1.16.1 if you follow the gaudi-doc, However, your firmeare is still 1.15.0, then, there would be error message showing above attach log file:

[WARNING|utils.py:198] 2024-06-20 09:20:32,266 >> optimum-habana v1.10.4 has been validated for SynapseAI v1.14.0 but the driver version is v1.16.1, this could lead to undefined behavior!

This should be the root cause for you to launch the pod failed!

lianhao commented 1 week ago

@zhlsunshine we definitely should file a bug in GenAIExample to report this kind of issue, I guess it's related to tgi-gaudi image version against host gaudi sw stack version.

zhlsunshine commented 1 week ago

Hi @lianhao, sure, besides that, I also found that the gaudi SW installation doc has some problem as well. Because this doc leads to the SW automatically upgrade even Firmware is still lower.