stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Login-only node in configless mode fails to get config #82

Closed sjpb closed 3 years ago

sjpb commented 3 years ago
[root@testohpc-login-0 /]# sinfo
sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
sinfo: error: fetch_config: DNS SRV lookup failed
sinfo: error: _establish_config_source: failed to fetch config
sinfo: fatal: Could not establish a configuration source

Note sinfo on control node works ok.

sjpb commented 3 years ago

Ok turns out this is expected. From configless slurm docs:

This slurmctld can be identified by either an explicit option, or — preferably — through DNS SRV records defined within the cluster itself. If you have a login node you will be running client commands from, those client commands will have to use the DNS record to get the configuration information from the controller when they run. If you expect to have a lot of traffic from a login node, this can generate a lot of requests for the configuration files. In cases like this, you may want to consider running slurmd on the machine so it can manage the configuration files, but not allowing it to run jobs.

Currently we only use the "explicit option", i.e. setting SLURMD_OPTIONS in /etc/sysconfig/slurmd.