Closed wsfowler closed 4 months ago
@wsfowler,
Thank you for raising the issues. We're currently in the process of actively refactoring the GenAIExamples to adhere to a microservice-based architecture. Please refer to the latest version of the README for updated instructions.
Setting HABANA_VISIBLE_DEVICES to "all" signifies that the system will allocate any available HPU device to the service. If you encounter a "Device acquire failed" error, it indicates that there are no free HPU devices available in the system.
@lvliang-intel
Understood on the refactoring, I'll try as things get updated. I did find another issue after some of the refactoring #153
Also, on the HPU device error, how would I go about troubleshooting this issue? I can load the Habana pytorch container and run hl-smi
and see the cards, but when I try to run it on the opea/tei-gaudi
container I get an error about the driver not being loaded.
I get the following if I run hl-smi
on the host:
root@ip-172-31-88-161:/opt/GenAIExamples/ChatQnA/microservice/gaudi# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.15.1-fw-49.0.0.0 |
| Driver Version: 1.15.1-62f612b |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-205 N/A | 0000:10:1d.0 N/A | 0 |
| N/A 46C N/A 101W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-205 N/A | 0000:90:1d.0 N/A | 0 |
| N/A 48C N/A 99W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-205 N/A | 0000:90:1e.0 N/A | 0 |
| N/A 49C N/A 100W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-205 N/A | 0000:a0:1d.0 N/A | 0 |
| N/A 47C N/A 108W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-205 N/A | 0000:a0:1e.0 N/A | 0 |
| N/A 46C N/A 100W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-205 N/A | 0000:10:1e.0 N/A | 0 |
| N/A 47C N/A 98W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-205 N/A | 0000:20:1e.0 N/A | 0 |
| N/A 47C N/A 103W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-205 N/A | 0000:20:1d.0 N/A | 0 |
| N/A 48C N/A 102W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
I'm trying to get the ChatQnA Gaudi Example to work and I'm running into a few issues.
First, in the
docker_compose.yaml
file, both thetei_embedding_service
and thetgi_service
have theHABANA_VISIBLE_DEVICES
setting toall
, not sure this is the correct setting? Should this be changed? Shouldn't each need to specify which cards they will try to allocate?The error message I get from these containers is:
If I specify the specific cards to allocate to each container then I get past these errors.
Second, for the
opea/gen-ai-comps:reranking-tei-server
container I'm getting the following error:Third, for the
ghcr.io/huggingface/tgi-gaudi:1.2.1
, after modifying thedocker_compose.yaml
file to not use theall
value forHABANA_VISIBLE_DEVICES
I get the following error:Fourth, for the
opea/tei-gaudi
container I get the follow error: