stackhpc / ansible-role-openhpc

Ansible role for OpenHPC
Apache License 2.0
45 stars 15 forks source link

Errors in logfile due to login node config #115

Closed sjpb closed 2 years ago

sjpb commented 2 years ago

Errors in logfile like:

Sep 17 09:29:28 alaska-control slurmctld[208988]: error: _slurm_rpc_node_registration node=alaska-login-0: Invalid argument

Is because partitions define a default node with details, e.g.:

NodeName=DEFAULT State=UNKNOWN \
    RealMemory=106897 \
    Sockets=2 \
    CoresPerSocket=15 \
    ThreadsPerCore=2

but we don't write a new DEFAULT for login nodes. So if they don't match the last compute partition, there is a mismatch on registration.

Can't be fixed by adding a NodeName=DEFAULT before the login node definition.

Can be fixed by putting login-node definitions BEFORE the first DEFAULT definition. Suggest:

# LOGIN-ONLY NODES
# Define slurmd nodes not in partitions for configless login-only nodes:
<templating>