Closed lianhao closed 2 months ago
This error generally means, it is unable to recognize the devices. And from hl-smi, it seems that you do not have any process running. Can you try the latest image please. Or can you provide me the steps that you are doing for me to be able to reproduce the issue.
The latest upstream image which is recently released seems to be working on gaudi sw v1.16.x. Maybe we should advice users to use upstream image instead of opea/tei-gaudi? I believe we created opea/tei-gaudi because by that time there was no published upstream image
yes, i was also run it with the latest image on 1.16. This image is particularly for 1.16.x release https://github.com/huggingface/tei-gaudi/pkgs/container/tei-gaudi/241185933?tag=synapse_1.16. Yes we can include validated configurations for the examples maybe.
Using the image opea/tei-gaudi:v0.7 on a gaudi-enabled k8s cluster doesn't work. When launch the pod, it will fail with the following error message during start up:
We're using the model
BAAI/bge-base-en-v1.5
However if I manually build opea/tei-gaudi image based on tei-gaudi tag synapse_1.16, it seems working.
We should release a new
opea/tei-gaudi
imageMy test environment is:
Host Environment: Ubuntu 22.04 with kernel 5.15.0-92-generic K8S ver: v1.29.5 containerd ver: 1.7.19 Gaudi SW stack: