siderolabs / extensions

Talos Linux System Extensions
Mozilla Public License 2.0
121 stars 120 forks source link

Fabric Manager unable to start #511

Open Hexoplon opened 3 weeks ago

Hexoplon commented 3 weeks ago

Using the default config, and the Nvidia production drivers, the fabric manager extension keeps crashing. It exits, with an error compalining that FABRIC_NODE_CONFIG_FILE is invalid.

What I've tried:

Tested on a HPE Cray XD670 with 8xH200.

Also noticed that topology files for H100 and H200 cards are not added to the fabric manager extension, will add a separate PR to address this.

smira commented 3 weeks ago

Please provide some extension logs for us to understand the problem better, thank you!

Hexoplon commented 3 weeks ago

@smira The log is not very helpful, it just repeats this one line

<node ip>: fabric manager config file item: FABRIC_NODE_CONFIG_FILE has an invalid value.

This is with the default fabric manager config file, as included in the extension. Can't really tell what would be wrong, unless there is an issue with one of the paths and the error message just is misleading.

askedrelic commented 3 weeks ago

Very timely report, I am also running into this issue with a 8 GPU H100 system. We are trying to convert this system to run Talos and everything works in Ubuntu with a standard fabricmanager config, for comparison.

I tried the topology fix before I saw your issue, but that also didn't change anything. I've been testing several options syntax differences, to see if I can narrow down the issue, but no luck yet. I can offer more help to debug.

askedrelic commented 2 weeks ago

I was able to get the nvidia-fabricmanager-lts version working, by manually rebuilding the extension to include the topology fix and running with the related lts nvidia-container-toolkit and nvidia-open-gpu-kernel-modules. This would point the bug more toward nvidia-fabricmanager. Perhaps it would be worth filing to NVIDIA.

I also tried re-building fabricmanager, container-toolkit and open-gpu-kernel modules with the next fabricmanager version 550.127.05 and the same fabricmanager error still occurred.

Hexoplon commented 2 weeks ago

Yeah, I'm gonna try with the 550.127.05 driver and fabric manager versions as well, but if it does not work for you it will probably not work here either. I am not able to use the LTS versions of the driver release either, as the H200 cards are to new for that driver. Will have to see if I can get it working in some other way.

Hexoplon commented 1 week ago

@askedrelic Still no luck on my part. Have you enabled any specific kernel modules, which are not listed in the documentation, to get it to work?

askedrelic commented 6 days ago

Hey @Hexoplon I stopped investigating since the nvidia-fabricmanager-lts version was able to unblock us. I would like to use latest, but can wait until someone from Talos takes a look.

For more data, our working Ubuntu server is 22.04 with kernel 5.15.0-126-generic and nvidia drivers/fabricmanager 550.127.05.

I was thinking, is there was a way to get a shell inside the container trying to run fabricmanager? Being able to debug realtime would be quicker vs having to re-build the extensions and re-install everytime. It's still hard for me see if this a Talos permissions issue or an NVIDIA issue.

Hexoplon commented 6 days ago

I've also got it working just fine on a RHEL 8 server. Even afte rmodifying the config file to match the replacements that Talos are performing in the extension, and it runs just fine there. When I diff the talos config file, with the one from the RHEL server, they are identical. Yet, somehow fabric manager on Talos says the config file is invalid.

Perhaps the issue is not directly related to the config file, but rather a missing host mount or perhaps missing kernel modules.

frezbo commented 6 days ago

We'll try to look into this and fix before the 1.9 release

Hexoplon commented 6 days ago

@frezbo fantastic! Let me know if I can be of any assistance in testing

frezbo commented 5 days ago

@frezbo fantastic! Let me know if I can be of any assistance in testing

actually if you could post the output of talosctl logs syslogd it could be helpful, fabricmanager logs to syslog

Hexoplon commented 4 days ago

@frezbo the only entry in the syslogd log is from nvidia-persistenced:
<node ip>: {"content":"nvidia-persistenced: Started (14)","facility":3,"hostname":"localhost","priority":29,"severity":5,"tag":"unknown","timestamp":"2024-11-19T20:57:15Z"}

(I mean only entry, that is the only line I get from talosctl logs syslogd)