Questions regarding ticket renewal and alternative Slurm AuthType

3XX0 commented 2 years ago

Hi,

We're in the process of evaluating AUKS for Kerberized deployments and I had few questions:

From my understanding, auksdrenewer is responsible for renewing the tickets in auksd, while the SPANK plugin process will renew those on each compute nodes (with auks -R loop). What happens when the ticket expires and is not renewable for long-running jobs? Is there a way to update the ticket ahead of time if the user got a fresh one? If so, does the user have to do this manually with the auks API or can it be automated somehow (something similar to GSS rekeying with PAM maybe?). Does it matter whether the job is running or not and is there any race we need to watch for?
Has there been any effort towards replacing Munge with a Kerberos based approach as the Slurm AuthType? It doesn't look like this project is addressing this but I guess most of its infrastructure could be reused for it.

hautreux commented 2 years ago

Hi, here are some answers concerning your first point :

your understanding is correct. auksdrenewer is in charge of periodically renewing all the still renewable TGTs pushed and stored in auksd. This ensure that jobs can start with a valid TGT even when they stay pending longer than the ticket initial lifetime.
in order to deal with long running jobs, that is to say, jobs that could run longer than the initial TGT renewable time, the auks -R loop helper tasks (started by the SPANK plugin on the compute nodes) renew their TGTs first using auksd through the auks API with a fault back renew mechanism using the KDC. This allows jobs to leverage new TGTs added by the associated users to auksd while the jobs are running. As long as users are pushing TGTs to auksd (whether submitting new jobs or calling auks -a) before the end of the renewable time of their TGT stored in auksd, everything will be fine. Otherwise, jobs will experience IO errors due to the lack of a valid kerberos credential.
There is no equivalent of the automatic GSS rekeying to automatically push tickets to auksd. A pull request is still pending to add a PAM module for auks (https://github.com/hautreux/auks/pull/22). It could certainly be used to do that. I'd never integrated it but that could be feasible, let me know if you give it a try. If they are multiple interests/requests for that, I could include it.

Concerning replacing Munge with a kerberos based approach as the Slurm AuthType, I would say that it is more a Slurm related feature than an auks one. This should be discussed with the Slurm developpers. But for sure, this is something of interest. I worked with a student in internship on a prototype of kerberized RPCs for Slurm about 10 years ago but it was unfortunately not as simple as creating a new AuthType plugin :(. The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.

Matthieu

3XX0 commented 2 years ago

Thank you for the detailled explanation, this is pretty much what I expected. The PAM plugin PR is exactly what I was looking for (somehow I missed it), I will play with it and report back on how it goes.

Regarding Munge, I agree this is more of a Slurm issue and it's good to know that you've looked into it before. I might look into it once we've got everything set up. Don't worry too much about digging the code as it may take a while :)

3XX0 commented 2 years ago

Reopening since I have an additional question:

I've had time to experiment a little and I was wondering if there is any reason why the SPANK logic is done in init, user_init and task_exit rather than job_prolog and job_epilog?

There are cases where multiple jobstep can be running simultaneously (e.g. salloc with use_interactive_step). In those cases, there will be multiple unique credentials created and an auks loop for each one of them.

So why is AUKS operating at the jobstep level rather than the job level?

3XX0 commented 1 year ago

The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.

@hautreux if it's not too much to ask, I would very interested if you could find it. I've talked to SchedMD about this and they are interested in seeing a PoC to see what can be done upstream.

hautreux commented 1 year ago

I am sorry but I am no longer in a position to access that and last time I check I did not find the code.

hautreux / auks

Questions regarding ticket renewal and alternative Slurm AuthType #66