hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

Another how to question. not all kerberos principals showing and no access to NFS shares #43

Closed trenta closed 2 years ago

trenta commented 4 years ago

Hello again,

I now have auks working with slurm.

[trenttesttwo@slurm-login01 tmp]$ auks -p Auks API request succeed

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktrJUXoi
Default principal: trenttesttwo@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ auks -r
Auks API request succeed

[trenttesttwo@slurm-login01 tmp]$ srun --auks=no klist -a
klist: No credentials cache found (filename: /tmp/krb5cc_1430606966_3dvoPL)
srun: error: scn01: task 0: Exited with exit code 1

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktdgFelH
Default principal: trenttesttwo@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)

I have NFSv4 shares with sec=krb5 and using these successfully in other locations in the business. The shares are mounted correctly and when a user is directly on a compute node they can access them. However accessing them via slurm isn't working. On a compute node if I su to a user and run kinit and then run a slurm command that use does have access to the share.

User: trenttesttwo on login node

[trenttesttwo@slurm-login01 tmp]$ klist -a
Ticket cache: FILE:/tmp/krb5cc_1430606966_3dvoPL
Default principal: trenttesttwo@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:24  03/07/20 21:43:24  host/slurm-cont01.ad.svi.edu.au@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files08.svi.edu.au@
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files08.svi.edu.au@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files12.svi.edu.au@
        renew until 01/10/20 11:43:12
        Addresses: (none)
03/07/20 11:43:29  03/07/20 21:43:29  nfs/files12.svi.edu.au@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tkt8dzGl2
Default principal: trenttesttwo@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch

[trenttesttwo@slurm-login01 tmp]$ srun ls -l /mnt
/usr/bin/ls: cannot access /mnt/mcfiles: Permission denied
/usr/bin/ls: cannot access /mnt/mcscratch: Permission denied
total 0
d????????? ? ? ? ?            ? mannbiofiles
d????????? ? ? ? ?            ? mcfiles
d????????? ? ? ? ?            ? mcscratch
/usr/bin/ls: cannot access /mnt/mannbiofiles: Permission denied
srun: error: scn01: task 0: Exited with exit code 1

Same user directly on that compute node: This is just to show that the compute node has the right access setup.

[trenttesttwo@scn01 ~]$ ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch

And now again from the login node

[trenttesttwo@slurm-login01 tmp]$ srun klist -a
Ticket cache: FILE:/tmp/tktpppRvy
Default principal: trenttesttwo@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
03/07/20 11:43:12  04/07/20 11:43:12  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 01/10/20 11:43:12
        Addresses: (none)

[trenttesttwo@slurm-login01 tmp]$ srun ls -l /mnt
total 22
drwxrws---. 17 root        50047 19 Mar 11 16:53 mannbiofiles
drwxrws---. 12 root mccart_files 14 Jun 15 15:35 mcfiles
drwxrws---. 13 root mccart_files 17 Feb 10 11:39 mcscratch

Grateful for any help.

Cheers

Trent

hautreux commented 4 years ago

I am wondering how you get Kerberos cred cache using file patterns like '/tmp/tkt...'. could you explain your configuration ? The standard naming is something like '/tmp/krb5cc%uid%random'. Depending on your NFS client configuration for kerberos, it may be the origin of the issue. The old way rpc.gssd worked, it used to iterate over the files with a pattern like /tmp/krb5cc_%uid* to grab the Kerberos material to access the kerberized file system. HTH

trenta commented 4 years ago

I think you're right with the issue. I've been poking around and still haven't found why I'm getting the format '/tmp/tkt...' when using auks.

On the login node as a normal user:

-bash-4.2$ klist
Ticket cache: FILE:/tmp/krb5cc_1430606678
Default principal: trenttest@AD.SVI.EDU.AU

-bash-4.2$ srun --auks=no klist
klist: No credentials cache found (filename: /tmp/krb5cc_1430606678)
srun: error: test-scn01: task 0: Exited with exit code 1

-bash-4.2$ srun klist
Ticket cache: FILE:/tmp/tkt9QtN56
Default principal: trenttest@AD.SVI.EDU.AU

Valid starting     Expires            Service principal
11/07/20 13:25:50  12/07/20 13:25:46  krbtgt/AD.SVI.EDU.AU@AD.SVI.EDU.AU
        renew until 09/10/20 14:25:46

-bash-4.2$ srun echo $KRB5CCNAME
FILE:/tmp/krb5cc_1430606678

I'll keep looking. I might also try and change everything over to use keyring unless you would advise otherwise.

Thanks

Trent

trenta commented 4 years ago

Changed to keyring and have a similar issue.

Login node:

klist
Ticket cache: KEYRING:persistent:1430606678:krb_ccache_CCEk08o

Compute node locally:

klist
Ticket cache: KEYRING:persistent:1430606678:1430606678

See the final part on the compute nodes is the UID and on the login node is krb_ccache_CCEk08o

The krb5.conf file is the same on both.

trenta commented 4 years ago

Gone back to FILE for the ticket cache.

Is the ticket cache file something that is specified in the AUKS config anywhere?

Because on all nodes including login and master as a normal user I get the following:

kinit -V
Using default cache: /tmp/krb5cc_1430606678

It is only when using srun to show klist that it shows the following:

srun klist
Ticket cache: FILE:/tmp/tktJXW6SI

And I'm not sure it's relevant but from my reading it looks like that's a krb4 ticket cache format.

trenta commented 4 years ago

I reverted to AUKS 0.4.4 which fixes the ticket cache issue but now a srun job succeeds but doesn't exit or complete.

When I run a job without AUKS it exits or completes fine.

trenta commented 4 years ago

Ah the problem I'm seeing seems to be #24

trenta commented 4 years ago

I applied that patch to my fork of 0.4.4 and it is all working. So my assumption is that in 0.5.0 the ticket cache files is an auks issue and not the way I set up kerberos. I really didn't think that would be the case and since I've done quite a bit of testing if you would like me to do any more to help with this issue or if you still feel it is probably my configuration please let me know.

Cheers

Trent

bcchrisupp commented 4 years ago

Hey @trenta what OS and OS release are you using? I'm having similar issues with CentOS 8 when using auks 0.5.0 or 0.4.4 (albeit with the changes from this commit and the patch from issue 24.

trenta commented 4 years ago

Currently RHEL 7.

bcchrisupp commented 4 years ago

Thanks for posting that @trenta It looks like my issues are likely due to CentOS 8 as I too can get auks working on RHEL 7.

hautreux commented 4 years ago

Patches were introduced in 0.5.0 to make KCM credential cache works. It seems that some regressions appears because of that. You should try to activate KCM on CentOS8 to see if it works better with that. We have some systems with rhel8, KCM enabled and auks-0.5.0 that works properly. Reverting back to 0.4.4 plus the mentioned patches is probably the best on rhel7 for now. I will need to look at that and do testing to understand what happens to fix it. I will not be able to do that before September.

bcchrisupp commented 4 years ago

@hautreux thanks for the suggestion, I've switched my machines back to using KCM but am still not able to get things working. I won't muddy up this issue with my problems any further, but I'll post further findings in my issue in case you can see something obvious that I've missed.

fihuer commented 3 years ago

Is the ticket cache file something that is specified in the AUKS config anywhere?

Before 0.5.0, the ccache name was always something like /tmp/krb5cc_%uid_%jobid_%random. Beggining with 0.5, auks uses the libkrb5's default ccache (krb5_cc_new_unique) that uses the one configured in krb5.conf and other related config files.

The /tmp/tkt%random sounds like a weird default written somewhere. Maybe sssd ?

Could you give us your krb5.conf (and conf.d friends) and the sssd.conf ?

kenshin33 commented 3 years ago

renaming the ticket cache to krb5cc_somthing_somthing and adjusting KRBCCNAME withint the job --where previously the home wouldn't mount enven after getting a service ticket for the nfs server (kvno nfs/fqdn@REALM) -- works like charm.

The problem seems to be nfsutils, they specifically look for anything that start with krb5cc and ignore anything else ... or so it seems. in nfs-utils-2.5.2/utils/gssd/krb5_util.c :

 192 static int
 193 select_krb5_ccache(const struct dirent *d)
 194 {
 195     /*
 196      * Note: We used to check d->d_type for DT_REG here,
 197      * but apparenlty reiser4 always has DT_UNKNOWN.
 198      * Check for IS_REG after stat() call instead.
 199      */
 200     if (strstr(d->d_name, GSSD_DEFAULT_CRED_PREFIX))
 201         return 1;
 202     else
 203         return 0;
 204 }

which get called by scandir taht gets called by gssd_find_existing_krb5_ccache and ultimatly : gssd_setup_krb5_user_gss_ccache (all functions are in the same file as the above function)

A work around this That Ithnink might work is renaming the ticket cache and redefining KRB5CCNAME in a taksprolog script (they run last before the if I'm not mistaken) ? obviously the above dows not really work ... as the renwer process has already been launched and theres' no way to change KRB5CCNAME in it's env (unless it is killed and restarted ... )

fihuer commented 3 years ago

This method to acquire credentials is a fallback to the libkrb5's traditional method. See nfs-utils's gss_proc.c, which calls libkrb5's gss_acquire_cred:

    maj_stat = gss_acquire_cred(&min_stat, GSS_C_NO_NAME, GSS_C_INDEFINITE,
                    &desired_mechs, GSS_C_INITIATE,
                    gss_cred, NULL, NULL);

As such, rpc-gssd will act as a user without any KRB5CCNAME : trying to find a ccache in the default location (defined in krb5.conf, or in "traditional" locations (ie. /tmp or /run/$USER, with some filters like you said).

Relying on rps-gssd fallback method to acquire credentials does not seem quite safe to me.

What kind of default_ccache_name do you have in your krb5.conf ?

kenshin33 commented 3 years ago

I define None, which by default should yield FILE:/tmp/krb5cc_%{uid} (according to the kerberos documentation.) I have in no way shape or form instructed rpc-gssd -- can't see how -- to do anything, and was assuming that it was using what ever mechanisms the kerberos library was offering (chief among them KRB5CCNAME environment variable).

Well of course gssd has no idea what KRB5CCNAME is pointing to ... it is runnning in it's own little place .. with it's own little env :duh: :manfacepalming: It tries the default thing (FILE:/tmp/krb5cc%{uid}) if nothing's there it falls back to trying files ... it was working all these years b/c it so happens that pam uses krb5cc_%UID_XXXXXX.

With that said, how would setting a default in krb5.conf change anything? krb5_cc_new_unique seem sot completely ignore the default (At least in the FILE: case, didn't check the other types as I cant really sue them for now)

I wrote a tiny c program that tries to do what the pluting does : without setting a default :

 ./krb5 
ccache_type = FILE
default ccache name = FILE:/tmp/krb5cc_1000
ccache_name = FILE:/tmp/tktKg5Q9U

with a default set (FILE:/tmp/somthing_%{uid})

./krb5 
ccache_type = FILE
default ccache name = FILE:/tmp/somthing_1000
ccache_name = FILE:/tmp/tktgpXEBE

with a default set (KEYRING)

ccache_type = KEYRING
default ccache name = KEYRING:session:keys
ccache_name = KEYRING:session:keys:krb_ccache_lMmeJb1

I'll give the above a try (but as I said I can't use keyring b/c of some old legacy stuff --a pandora's box--)

IMHO auks should revert to using krb5cc_ if the cache type is FILE

code froom mit-krb5-1.18's krb5_cc_new_unique (well, what ends up being called in the case of FILE: fcc_generate_new)

924 static krb5_error_code KRB5_CALLCONV
925 fcc_generate_new(krb5_context context, krb5_ccache *id)
926 {
927     char scratch[sizeof(TKT_ROOT) + 7]; /* Room for XXXXXX and terminator */
928 
929     (void)snprintf(scratch, sizeof(scratch), "%sXXXXXX", TKT_ROOT);
930     return krb5int_fcc_new_unique(context, scratch, id);
931 }

scratch above is the template that gets passed to mkstemp in krb5int_fcc_new_unique and what ends up as the filename for the ticket cache, no paramter in krb5.conf will change that.

 814 /* Generate a unique file ccache using the given template (which will be
 815  * modified to contain the actual name of the file). */
 816 krb5_error_code
 817 krb5int_fcc_new_unique(krb5_context context, char *template, krb5_ccache *id)
 818 {
 819     krb5_ccache lid;
 820     int fd;
 821     krb5_error_code ret;
 822     fcc_data *data;
 823     char fcc_fvno[2];
 824     int16_t fcc_flen = 0;
 825     int errsave, cnt;
 826 
 827     fd = mkstemp(template);
 828     if (fd == -1)
 829         return interpret_errno(context, errno);
 830     set_cloexec_fd(fd);
 831 
 832     /* Allocate memory */
 833     data = malloc(sizeof(fcc_data));
 834     if (data == NULL) {
 835         close(fd);
 836         unlink(template);
 837         return KRB5_CC_NOMEM;
 838     }
 839 
 840     data->filename = strdup(template);
 841     if (data->filename == NULL) {
 842         free(data);
 843         close(fd);
 844         unlink(template);
 845         return KRB5_CC_NOMEM;
 846     }
 ...
 917 }

patch (a crude one) : https://gist.github.com/kenshin33/b406aa1a6668a87b5e14ff3fedc48ea3 it works, the question is: is it sane/safe!

fihuer commented 3 years ago

That's precisely where the bug seems to be.

By quickly checking into libkrb5 code, it seems like if there's no default in configuration, krb5_cc_new_unique does not rely on DEFCCNAME but on the very old défaut /tmp/tktXXX. https://github.com/krb5/krb5/blob/881b5312f85216f27a2a2f2560edc4e81a0d939a/src/lib/krb5/ccache/cc_file.c#L932

If you want to use kcm or file ccache with auks and kerberized NFS mounts, you must define it in krb5.conf.

I'm pretty sure it's a libkrb5 bug.

Cheers,

Le dim. 1 nov. 2020 à 19:48, Raouf Bencheraiet notifications@github.com a écrit :

I define None, which by default should yield FILE:/tmp/krb5cc_%{uid} (according to the kerberos documentation.) I have in no way shape or form instructed rpc-gssd -- can't see how -- to do anything, and was assuming that it was using what ever mechanisms the kerberos library was offering (chief among them KRB5CCNAME environment variable).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hautreux/auks/issues/43#issuecomment-720134187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASDRFKAAY4RMZLXIGMDQKTSNWUQ5ANCNFSM4OPNPGQQ .

kenshin33 commented 3 years ago

I don's see any decision making here: https://github.com/krb5/krb5/blob/881b5312f85216f27a2a2f2560edc4e81a0d939a/src/lib/krb5/ccache/cc_file.c#L932 Default or not, with a default cache type set (or left alone i.e ) to "FILE:"

the whole of krb5_cc_new_unique: https://github.com/krb5/krb5/blob/881b5312f85216f27a2a2f2560edc4e81a0d939a/src/lib/krb5/ccache/ccbase.c#L284 ignoring TRACE_CC_NEW_UNIQUE, it basically gets the ops struct to call the new function of that type. in the case of FILE: (I didn't bother to check the rest of the types as I have no use for the right now) is the fcc_generate_new above, as you can see it has absolutely no decision making at all regarding the naming scheme it passes this template "${TKTROOT}XXXXXX" to the next function that takes it as it is an passe it to mkstmp (TKTROOT is #defineb in the same file as '/tmp/tkt')

unless I'm blind setting a default (or not) has absolutely no bearing on the situation we're in! Or can you please point me in a direction of a "default" that will do the trick ??

(notice this : https://github.com/krb5/krb5/blob/881b5312f85216f27a2a2f2560edc4e81a0d939a/src/lib/krb5/ccache/ccbase.c#L293 as i thing it is a source of meme leak in auks right now)

nktl commented 3 years ago

This is definitively still a problem with RHEL8, these /tmp/tk* tickets generated by AUKS are sadly pretty much useless, not just for NFS access, but for any kind of krb auth activity which relies on ticket cache defined in krb5.conf (FILE:/tmp/krb5cc_%{uid}) - like SQL, Kafka, web auth, etc.

Thank you for the patch @kenshin33 - it seems to work well for us!

fihuer commented 3 years ago

Can you send us the default settings defined in krb5.conf ? With this we can try to reproduce it on a RHEL8 system.

Le mer. 18 août 2021 à 19:56, nktl @.***> a écrit :

This is definitively still a problem with RHEL8, these /tmp/tk* tickets generated by AUKS are sadly pretty much useless, not just for NFS access, but for any kind of krb auth activity which relies on ticket cache defined in krb5.conf (FILE:/tmp/krb5cc_%{uid}) - like SQL, Kafka, web auth, etc.

Thank you for the patch @kenshin33 https://github.com/kenshin33 - it seems to work well for us!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hautreux/auks/issues/43#issuecomment-901315046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASDRFOQHDFIFRXWRABN23TT5PX6XANCNFSM4OPNPGQQ .

nktl commented 3 years ago

Sure, the config is as follows:

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = DOMAIN.LOCAL
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 31d
 forwardable = true
 rdns = false
 default_ccache_name = FILE:/tmp/krb5cc_%{uid}

[realms]
 DOMAIN.LOCAL = {
 kdc = ad01.domain.local
 kdc = ad02.domain.local
 kdc = ad03.domain.local
 admin_server = lonpdc01.domain.local
}

[domain_realm]
 .domain.local = DOMAIN.LOCAL
 domain.local = DOMAIN.LOCAL

 [appdefaults]
 pam = {
  debug = false
  DOMAIN.LOCAL = {
   ignore_k5login = true
  }
 }

BTW, the vanilla config is actually using kernel keyring rather thank FILE based cache, but we do not use it - plenty of kerberized apps and 3rd party libraries still do not support keyring cache well (or - at all), it causes a lot of problems, so we are falling back to FILE setup.

trenta commented 3 years ago

For EL8 I couldn't get it working with FILE setup so I changed to KCM. I also applied this patch to 0.5.0 https://github.com/hautreux/auks/pull/53 and it is working well.

nktl commented 3 years ago

Good point, we had to apply patch #53 as well to get it to work with krb5-libs 1.18, introduced in RHEL 8.2 or so. Sadly KCM won't work for us, we need to use FILE.

fihuer commented 3 years ago

Hmkay, I've developped a patch on the krb5 libs that fixes the behavior of auks.

On EL8 systems, when PR#53 is applied on auks and PR#1211 on krb5-libs it should be okay.

To resume:

Will check with @hautreux to make the master version of auks work in all cases.

Please try-it out and tell us if it sounds ok for you. This should help in #56 too.

hautreux commented 2 years ago

legacy file logic has been added in version 0.5.3 in 53347ab9a10246f2b9108039343887d1ae3adef8 using a dedicated spank option. Please consider trying it if you need FILE support for ccache.