hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

spank-auks: mode disabled #34

Closed robberteggermont closed 4 years ago

robberteggermont commented 4 years ago

Recently a significant number jobs failed. The common factor seems to be 'spank-auks: mode disabled' entries in the logs for these jobs. The slurm & auks configuration were not changed at all around this time span.

I'm now wondering what chain of events could lead to the spank-auks plugin being loaded but not being able to read the configuration parameters. It seems both the loading of spank-auks and the configuration happen via the same /etc/slurm/plugstack.conf.d/auks.conf file. So spank-auks should either work, or not be loaded at all?

Is the /etc/slurm/plugstack.conf.d/auks.conf file loaded only once, or every time a new job is started? Is it loaded by slurmd or by slurmstepd (or both)?

Are there other reasons (besides the default parameter) for 'mode disabled' (such as hostname resolver problems, network connection problems, job parameter problems, ...)?

Any ideas?

hautreux commented 4 years ago

the auks.conf file is loaded by :

The conf file must enable the spank auks plugin on submission as well as compute nodes.

If submission nodes do not enable auks, but compute nodes do, then compute nodes will try to contact the auksd daemon to get something and most probably fail attempting to get the credential of the user.

If compute nodes do not enable auks in the conf, then you will get "mode disabled" for every job step started.

An other reason to see the "mode disabled" is when users are disabling auks at submission using "--auks=no".

The last reason to see a "mode disabled" is when users do not have a valid kerberos credential while calling srun/sbatch/salloc. In that case, spank-auks silently ignores the error and disable auks setting the SLURM_SPANK_AUKS env var to "no" at submission. The slurmstepd will get that env var and disable the spank auks logic when initializing/starting the tasks.

It is written in auks/src/plugins/slurm/slurm-spank-auks.c here :

/* send credential to auks daemon */
fstatus = auks_api_add_cred(&engine,NULL);

if (fstatus == AUKS_ERROR_KRB5_CRED_READ_CC) {
    if (!auks_enforced) {
        /* If no credential cache and we are not in enforced
         * mode, assume no auks support to avoid printing error
         * messages to non kerberized users */
        xinfo("cred forwarding failed : %s",
              auks_strerror(fstatus));
        xinfo("no readable credential cache : "
              "disabling auks support");
        fstatus = setenv("SLURM_SPANK_AUKS","no",0);
        if ( fstatus != 0 ) {
            xerror("unable to set SLURM_SPANK_AUKS to no");
        }
    }

Not that you can set the "enforced" option in the configuration file to modify this logic.

More information in the man page "man auks.so" :

... enforced when set, consider that a missing credential at submission will be treated as an error by the Spank stack. Spank plugins being able to be configured as optional or required, set internal auks status to done in order to allow slurmstepds to acquire a previously pushed credential if it exists and the mode is optional. When not set, auks will silently disable auks if no credential cache is available at submission time. To sum up :

To summarize, the users owning the failing jobs probably do not have a valid kerberos credential at submission time. You should set "enforced" on the auks.conf spank conf file and probably set the spank mode to "required" instead of optional to make the submissions fail in that case. It will help to identify where things break in order to understand how/why these situations happen on the user side.

HTH

robberteggermont commented 4 years ago

Thanks for your thorough answer. I changed the auks plugin to required and enforced.

However, during my initial testing I found that when spank-auks is unable to get a ticket (when a ticket expired while a job was waiting in the queue), this triggers a 'batch job complete failure' and the node is drained. I would prefer the node not to be drained (since this most likely is just user-related). Any ideas for that?

Otherwise I might be better off with optional + enforced?

hautreux commented 4 years ago

You should set the required node on the login nodes and the optional on the compute nodes. Note that having the node being drained may be helpful : if for some reason your node as a krb5 issue, jobs will not get their credentials, will fail, the node will become a blackhole sucking all the pending jobs of your job queue. You can check in a slurmstepd prolog if your job still has a Kerberos ticket in auksd and hold it in the queue in that case (waiting for the user to push a new ticket in auksd).