Open Hexoplon opened 3 weeks ago
Please provide some extension logs for us to understand the problem better, thank you!
@smira The log is not very helpful, it just repeats this one line
<node ip>: fabric manager config file item: FABRIC_NODE_CONFIG_FILE has an invalid value
.
This is with the default fabric manager config file, as included in the extension. Can't really tell what would be wrong, unless there is an issue with one of the paths and the error message just is misleading.
Very timely report, I am also running into this issue with a 8 GPU H100 system. We are trying to convert this system to run Talos and everything works in Ubuntu with a standard fabricmanager config, for comparison.
I tried the topology fix before I saw your issue, but that also didn't change anything. I've been testing several options syntax differences, to see if I can narrow down the issue, but no luck yet. I can offer more help to debug.
I was able to get the nvidia-fabricmanager-lts version working, by manually rebuilding the extension to include the topology fix and running with the related lts nvidia-container-toolkit and nvidia-open-gpu-kernel-modules. This would point the bug more toward nvidia-fabricmanager. Perhaps it would be worth filing to NVIDIA.
I also tried re-building fabricmanager, container-toolkit and open-gpu-kernel modules with the next fabricmanager version 550.127.05 and the same fabricmanager error still occurred.
Yeah, I'm gonna try with the 550.127.05 driver and fabric manager versions as well, but if it does not work for you it will probably not work here either. I am not able to use the LTS versions of the driver release either, as the H200 cards are to new for that driver. Will have to see if I can get it working in some other way.
@askedrelic Still no luck on my part. Have you enabled any specific kernel modules, which are not listed in the documentation, to get it to work?
Hey @Hexoplon I stopped investigating since the nvidia-fabricmanager-lts version was able to unblock us. I would like to use latest, but can wait until someone from Talos takes a look.
For more data, our working Ubuntu server is 22.04 with kernel 5.15.0-126-generic and nvidia drivers/fabricmanager 550.127.05.
I was thinking, is there was a way to get a shell inside the container trying to run fabricmanager? Being able to debug realtime would be quicker vs having to re-build the extensions and re-install everytime. It's still hard for me see if this a Talos permissions issue or an NVIDIA issue.
I've also got it working just fine on a RHEL 8 server. Even afte rmodifying the config file to match the replacements that Talos are performing in the extension, and it runs just fine there. When I diff the talos config file, with the one from the RHEL server, they are identical. Yet, somehow fabric manager on Talos says the config file is invalid.
Perhaps the issue is not directly related to the config file, but rather a missing host mount or perhaps missing kernel modules.
We'll try to look into this and fix before the 1.9 release
@frezbo fantastic! Let me know if I can be of any assistance in testing
@frezbo fantastic! Let me know if I can be of any assistance in testing
actually if you could post the output of talosctl logs syslogd
it could be helpful, fabricmanager logs to syslog
@frezbo the only entry in the syslogd log is from nvidia-persistenced:
<node ip>: {"content":"nvidia-persistenced: Started (14)","facility":3,"hostname":"localhost","priority":29,"severity":5,"tag":"unknown","timestamp":"2024-11-19T20:57:15Z"}
(I mean only entry, that is the only line I get from talosctl logs syslogd
)
Using the default config, and the Nvidia production drivers, the fabric manager extension keeps crashing. It exits, with an error compalining that
FABRIC_NODE_CONFIG_FILE
is invalid.What I've tried:
Tested on a HPE Cray XD670 with 8xH200.
Also noticed that topology files for H100 and H200 cards are not added to the fabric manager extension, will add a separate PR to address this.