Open 3XX0 opened 2 years ago
Hi, here are some answers concerning your first point :
auksdrenewer
is in charge of periodically renewing all the still renewable TGTs pushed and stored in auksd
. This ensure that jobs can start with a valid TGT even when they stay pending longer than the ticket initial lifetime.auks -R loop
helper tasks (started by the SPANK plugin on the compute nodes) renew their TGTs first using auksd
through the auks API with a fault back renew mechanism using the KDC. This allows jobs to leverage new TGTs added by the associated users to auksd
while the jobs are running. As long as users are pushing TGTs to auksd (whether submitting new jobs or calling auks -a
) before the end of the renewable time of their TGT stored in auksd
, everything will be fine. Otherwise, jobs will experience IO errors due to the lack of a valid kerberos credential.Concerning replacing Munge with a kerberos based approach as the Slurm AuthType, I would say that it is more a Slurm related feature than an auks one. This should be discussed with the Slurm developpers. But for sure, this is something of interest. I worked with a student in internship on a prototype of kerberized RPCs for Slurm about 10 years ago but it was unfortunately not as simple as creating a new AuthType plugin :(. The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.
Matthieu
Thank you for the detailled explanation, this is pretty much what I expected. The PAM plugin PR is exactly what I was looking for (somehow I missed it), I will play with it and report back on how it goes.
Regarding Munge, I agree this is more of a Slurm issue and it's good to know that you've looked into it before. I might look into it once we've got everything set up. Don't worry too much about digging the code as it may take a while :)
Reopening since I have an additional question:
I've had time to experiment a little and I was wondering if there is any reason why the SPANK logic is done in init
, user_init
and task_exit
rather than job_prolog
and job_epilog
?
There are cases where multiple jobstep can be running simultaneously (e.g. salloc
with use_interactive_step
). In those cases, there will be multiple unique credentials created and an auks loop
for each one of them.
So why is AUKS operating at the jobstep level rather than the job level?
The auth API of Slurm had to be modified and we never went further than a first roughly working proof of concept (using the GSSAPI, not the Auks internals). I am not even sure that I still have the code/patch, but if you are interested on working on that, I could do some digging and try to find that again.
@hautreux if it's not too much to ask, I would very interested if you could find it. I've talked to SchedMD about this and they are interested in seeing a PoC to see what can be done upstream.
I am sorry but I am no longer in a position to access that and last time I check I did not find the code.
Hi,
We're in the process of evaluating AUKS for Kerberized deployments and I had few questions:
From my understanding,
auksdrenewer
is responsible for renewing the tickets inauksd
, while the SPANK plugin process will renew those on each compute nodes (withauks -R loop
). What happens when the ticket expires and is not renewable for long-running jobs? Is there a way to update the ticket ahead of time if the user got a fresh one? If so, does the user have to do this manually with the auks API or can it be automated somehow (something similar to GSS rekeying with PAM maybe?). Does it matter whether the job is running or not and is there any race we need to watch for?Has there been any effort towards replacing Munge with a Kerberos based approach as the Slurm AuthType? It doesn't look like this project is addressing this but I guess most of its infrastructure could be reused for it.