hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

Is there a consensus on which commit to use for RHEL7 with Kerberized NFS? #56

Closed whamblen closed 2 years ago

whamblen commented 3 years ago

We spent some time this week trying to deploy Slurm 20.11.3 with auks 0.5.0 on our CentOS 7 cluster. It seems clear that we want a release that is in between 0.4.4 and 0.5.0 and I'm hoping somebody knows where the sweet spot is!

0.5.0 doesn't work with Kerberized NFS because auks creates caches of the form /tmp/tkt and NFS only looks for /tmp/krb5cc_uid. See issue #43. This took embarrassingly long to discover for ourselves. :-)

We are also experiencing the issue described in #23 where gssproxy and auks fight over root's ticket cache (RHEL7.4 and newer). The workaround described there looks perfectly fine for 0.4.4 but the relevant environment variable seems to have been removed from 0.5.0. Disabling gssproxy isn't an attractive option for us (but I won't rule it out) so I'd like to avoid a version where AUKS_PRIV_CCACHE_APPEND has been removed.

Unfortunately 0.4.4 apparently doesn't terminate jobs when running newer versions of Slurm. See issue #24 (which has a commit that purports to fix it). We haven't actually tested this ourselves but obviously we'd want that patch.

Does anyone who has already gone down this road have advice? @kenshin33 @trenta

kenshin33 commented 3 years ago

I don't know about gssproxy (not there yet) but fot the NFS thing I took the latest version tag and patched it (I put alink in one of my comments in #43) it ssems to work (crude and ugly but it works). I plan to sit and do it correctly, but I don't have the time for now.

whamblen commented 3 years ago

Thanks for the reply kenshin33! I haven't tried your patch from #43 yet. To be honest, I was a little leery when you yourself questioned the sanity/safety of it. :-) Since we are brand new to Slurm/auks (currently running moab/torque), I was going for the option with the fewest potential surprises first so we could start submitting jobs and seeing what's different.

Are you currently using that patch even if it's ugly? I've run out of time for this project till next week, but I can definitely try it in our environment if that would be useful.

trenta commented 3 years ago

Again not a pretty solution but I forked this and added patch hautreux#24 to the 4.4 branch because I was seeing the issue where jobs weren't terminating. Becuase it's been a while since I got it it working feel free to check out https://github.com/trenta/auks/commits/0.4.5 I have it working with Slurm 19 and RHEL 7.9. I also had the ticket caching issue and to sort that out I added the following to the [libdefaults] section of /etc/krb5.conf

ccache_type      = 4 
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
whamblen commented 3 years ago

Thanks for the link trenta! I'll take a look at that when I get back to this.

I forgot to mention when replying to kenshin33 that I did get a functioning auks based off 0.4.4 with the #24 patch applied and the AUKS_PRIV_CCACHE_APPEND workaround for gssproxy. I'm going to let people bang on that a little bit (since we are so new to Slurm) before changing anything else.

kenshin33 commented 3 years ago

Again not a pretty solution but I forked this and added patch hautreux#24 to the 4.4 branch because I was seeing the issue where jobs weren't terminating. Becuase it's been a while since I got it it working feel free to check out https://github.com/trenta/auks/commits/0.4.5 I have it working with Slurm 19 and RHEL 7.9. I also had the ticket caching issue and to sort that out I added the following to the [libdefaults] section of /etc/krb5.conf

ccache_type      = 4 
default_ccache_name = FILE:/tmp/krb5cc_%{uid}

Wouldn't taht force all all applications to use the same file. I.e : start a job ssh into the node exit the shell bye bye ticket, start a job, start a second shorter one that ends up on the same node, bye bye ticket.

Perfectly working default behaviour was broken by the the patch that introduced this issue

KoenDierckx commented 3 years ago

We are having the same issue.

Currently trying to fix this in 0.5.0 with the patch from @kenshin33 mentioned here: https://github.com/hautreux/auks/issues/43#issuecomment-720134187

I did have to recreate this patch, as it wouldn't apply at first.

Would be great if @hautreux could have a look at this patch !

fihuer commented 3 years ago

Please check with the krb5-libs patch pointed in #43.

Cheers,

hautreux commented 2 years ago

Please consider using the new 0.5.3 version, it provides a new option to revert to file backed ccache in the auks spank plugin. (see 53347ab9a10246f2b9108039343887d1ae3adef8)