hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

rejecting jobs when there are no kerberos tickets #13

Closed sreedharmanchu closed 8 years ago

sreedharmanchu commented 8 years ago

Hi,

When I don't have a kerberos credential, srun shows error but job still runs. How do we make it so that job gets rejected?

After doing kdestroy, jobs still run fine with sbatch and srun spitting out error that auks cred extraction failed. Is it possible to reject jobs? If so what needs to be done in the auks configuration.

Right now this is what I have in auks.conf in /etc/slurm/plugstack.d/auks.conf (I pointed it to plugstackconfig in slurm.conf)

optional /usr/lib64/slurm/auks.so default=enabled spankstackcred=yes minimum_uid=1024 sync=no

I have this only on submit and compute nodes. Do I need this on the front end where auksd runs?

I would really appreciate any advise.

Thanks in advance, Sreedhar.

hautreux commented 8 years ago

Just replace the 'optional' parameter with 'required' in /etc/slurm/plugstack.conf.d/auks.conf and you should be all set.

sreedharmanchu commented 8 years ago

Thank you. After writing it occurred to me that it's like a pam plugin and so went and read about spank plugins and it was there. I was embarrassed that it didn't occur to me at the time of writing. Either way thank you very much. Now I see that it works but my understanding is fully not there.

I am on a submit/login node and I destroyed my ticket with kdestroy. Then I run srun hostname and it works. Even srun --auks=yes hostname works as well. But if I do --auks=no or --auks=done it rejects. I see that in the logs auks.so has fail exit code and so plugin in rejects.

Is it possible for job to be rejected without mentioning --auks at all? Essentially, when I don't have a ticket running srun hostname should fail.

Then, I did bit more testing to understand renewal and other things.

I do see that there is no ticket for me in /var/cache/auks on management server. Yet srun hostname and srun --auks=yes worked. I am sure I'm not understanding it right here. I am missing something and i'd really appreciate it if you can explain a bit here.

Finally, I did kinit and I see that I have a ticket issue wth klist.

Just like expected I don't have a ticket copied yet into /var/cache/auks on management server.

Then, I ran srun hostname and just like expected my ticket was copied onto /var/cache/auks on management server.

I ran srun hostname, srun with --auks=yes, --auks=no and --auks=done.

IN all cases, my ticket was still there in /var/cache/auks and 4 of them ran fine.

Then I do kdestroy again on submit node. I see that my ticket was destroyed with klist.

My ticket is still there on management server in /var/cache/auks.

I ran srun hostname, srun with --auks=yes, --auks=no and --auks=done. All of them ran fine and my ticket is still there.

This confuses me. I don't have ticket now on submit node but I was able to run jobs fine. Is it because the copied ticket onto management node has a lifetime on it? How do lifetime and renewal time play role here? Right now we have ticket_lifetime = 12d and nenew_lifetime = 30d in our krb5.conf. Does this mean one time I run job and I can still run jobs fine for 12 (or 30 days?) even when I dont' have any tickets on my submit node?

Finally, as a root I went and deleted my ticket from /var/cache/auks on management server and restarted auksd (without restarting jobs were still working fine) and then I was able to produce the same behavior I mentioned at the beginning. (srun and srun --auks=yes work but not --auks=no and --auks=done).

Finally, you have mentioned about SLURM_SPANK_AUKS environment variable in the auks spank example configuration file. What role does it play here and where should I put it for it to get into effect?

And also I see that users can put --auks=[yes|no|done]. Ideally, we would like to reject jobs when users don't have tickets whether they put auks in the command line or not. Is it possible?

Please let me know. I am really sorry I asked you too many questions to answer here. But I am really having hard time in making sense enough for myself to explain it to users.

Thanks in advance. I really hope these answers will help others as well.

Best, Sreedhar.

hautreux commented 8 years ago

I think that most of the answers to your questions are in the man pages of auks/auksd/... and auks-spank-plugin (man auks.so), you should read them again.

You have to distinguish between what kerberos utilities enable to do and what auks utilities (including the spank plugin) enable to do.

kinit/klist/kdestroy let you manage your kerberos credential cache to respectively initialize it, list its content or clean it.

auks/auks_spank_plugin let you transfer a derivated TGT from the content of your credential cache to the auks daemon (aukds), get it back from the same user or a privileged (admin) principal, delete it from the auksd daemon.

Once a TGT is present in the auksd daemon, it is periodically renewed by the auksdrenewer daemon as long as such an operation can be performed (endtime < renew_time). Once the TGT is useless (endtime reached and endtime=renew_time), it is removed from the auksd cache by the auksd daemon itself.

If you push a TGT to auksd using either "auks -a" or "srun --auks=yes", it can be retrieved later when you submit a job by the privileged principal used by slurmd/slurmstepd on the compute nodes. It can be retrieved even if you do not have a valid TGT in your credential cache at submission time if auks.so is configured as optional in the spank stack. If you have configured auks.so to be 'required', the submission will still succeed if you do not have such a valid TGT at submission time (even if one exists on the auksd side). You will have to add the 'enforced' option in the spank stack to make such submission fail (see man auks.so)

SLURM_SPANK_AUKS env variable is alternative way to configure auks in slurm. It replaces the content of the --auks=... parameter when it is not properly defined on the command line.

If you want auks to be fully transparent to your users and ensure that your users have a valid TGT at submission time, you have to :

If you want to remove a TGT from the auksd cache, you can simply use the 'auks -r [-u uid]' command.

HTH, Matthieu

sreedharmanchu commented 8 years ago

Hi Matthieu,

Thank you so much for clear explanation. Just before seeing your reply I came across one of the man pages online and realized what I had was not matching. Then I compared the versions and I had bit old version. I upgraded to new version and then adding enforced with required made things work.

Thanks again, Sreedhar.