Closed zhlsunshine closed 1 week ago
Thanks for @leslieluyu's quick action and @lvliang-intel 's involving, the root cause is clear, please make sure that the SW stack is correct. For example, if the Firmware is 1.15.0
, please make sure your SW stack versions are 1.15.0
or 1.15.1
.
However, it may upgrade your SW stack versions to 1.16.0
or 1.16.1
if you follow the gaudi-doc, However, your firmeare is still 1.15.0
, then, there would be error message showing above attach log file:
[WARNING|utils.py:198] 2024-06-20 09:20:32,266 >> optimum-habana v1.10.4 has been validated for SynapseAI v1.14.0 but the driver version is v1.16.1, this could lead to undefined behavior!
This should be the root cause for you to launch the pod failed!
@zhlsunshine we definitely should file a bug in GenAIExample to report this kind of issue, I guess it's related to tgi-gaudi image version against host gaudi sw stack version.
Hi @lianhao, sure, besides that, I also found that the gaudi SW installation doc has some problem as well. Because this doc leads to the SW automatically upgrade even Firmware is still lower.
I can not deploy TGI and TEI service in my Gaudi Kubernetes cluster via GMC, even just use
kubectl apply
as well, such as:kubectl apply -f qna_configmap_gaudi.yaml
kubectl apply -f tei_embedding_gaudi_service.yaml
kubectl apply -f tgi_gaudi_service.yaml
I have validated this with @leslieluyu, and the result is below:I can exclude the gaudi node problem because I can successfully deploy
habanalabs-gaudi-demo
based on example (but change the Job into Pod), I have validated this with @lianhao.Strangely, @leslieluyu show me that the microservice is okay in their Kubernetes Gaudi node. I open the issue to track this problem, still need Luyu's help on this.
tei-gaudi-embedding-svc-deployment.log tgi-gaudi-svc-deployment.log