hautreux / auks

Kerberos credential support for batch environments
Other
21 stars 18 forks source link

spank-auks: unable to get user xxxxxxx cred : auks api : reply seems corrupted #27

Closed afont closed 6 years ago

afont commented 6 years ago

Hi all,

I'm trying to configure auks on centos 7.5 slurm cluster, but I'm facing this errors:

When running auks -a as root, it shows: 'No translation available for requested principal':

[root@pood006 alex]# klist Ticket cache: KEYRING:persistent:0:0 Default principal: host/pood006.int.34tech.io@INT.34TECH.IO

Valid starting Expires Service principal 08/15/2018 18:03:59 08/16/2018 18:03:06 host/pood006.int.34tech.io@INT.34TECH.IO 08/15/2018 18:03:06 08/16/2018 18:03:06 krbtgt/INT.34TECH.IO@INT.34TECH.IO

[root@pood006 alex]# /usr/local/bin/auks -avvvvvvvvv Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf Wed Aug 15 18:24:39 2018 [INFO2] [euid=0,pid=25130] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon is 'pood006.int.34tech.io' Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon address is 'pood006.int.34tech.io' Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon port is 12345 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine primary daemon principal is host/pood006.int.34tech.io@INT.34TECH.IO Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon is 'localhost' Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon address is 'localhost' Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon port is 12345 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine secondary daemon principal is Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine logfile is /tmp/auksapi.log Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine loglevel is 9 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine debugfile is /tmp/auksapi.log Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine debuglevel is 9 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine retry number is 3 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine timeout is 10 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine delay is 3 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine NAT traversal mode is enabled Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_loglevel is 9 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer_debuglevel is 9 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer delay is 60 Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_engine: engine renewer min cred lifetime is 600 Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: kerberos context successfully initialized Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache successfully resolved Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache sequential read successfully started Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential cache sequential read successfully stopped Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: TGT found in credential cache Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: kerberos authentication context successfully initialized Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential successfully dumped into buffer Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_krb5_cred: credential successfully stored in output buffer Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: kerberos context successfully initialized Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: input buffer credential successfully unserialized Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: principal successfully unparse Wed Aug 15 18:24:39 2018 [INFO4] [euid=0,pid=25130] auks_cred: unable to get username from principal host/pood006.int.34tech.io@INT.34TECH.IO : No translation available for requested principal Wed Aug 15 18:24:39 2018 [INFO3] [euid=0,pid=25130] auks_api: auks cred extraction failed : auks cred : unable to convert principal to local name Auks API request failed : auks cred : unable to convert principal to local name

But when running on a normal user, it seems to work.

[alex@pood006 ~]$ klist Ticket cache: KEYRING:persistent:67200010:krb_ccache_rDbOwm0 Default principal: alex@INT.34TECH.IO

Valid starting Expires Service principal 08/15/2018 18:26:25 08/16/2018 18:20:50 host/pood006.int.34tech.io@INT.34TECH.IO 08/15/2018 18:20:50 08/16/2018 18:20:50 krbtgt/INT.34TECH.IO@INT.34TECH.IO

[alex@pood006 ~]$ auks -avvvvvvvvv Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf Wed Aug 15 18:26:25 2018 [INFO2] [euid=67200010,pid=25262] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon is 'pood006.int.34tech.io' Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon address is 'pood006.int.34tech.io' Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon port is 12345 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine primary daemon principal is host/pood006.int.34tech.io@INT.34TECH.IO Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon is 'localhost' Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon address is 'localhost' Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon port is 12345 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine secondary daemon principal is Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine logfile is /tmp/auksapi.log Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine loglevel is 9 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine debugfile is /tmp/auksapi.log Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine debuglevel is 9 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine retry number is 3 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine timeout is 10 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine delay is 3 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine NAT traversal mode is enabled Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_loglevel is 9 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer_debuglevel is 9 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer delay is 60 Wed Aug 15 18:26:25 2018 [INFO3] [euid=67200010,pid=25262] auks_engine: engine renewer min cred lifetime is 600 Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: kerberos context successfully initialized Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache successfully resolved Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache sequential read successfully started Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential cache sequential read successfully stopped Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: TGT found in credential cache Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: kerberos authentication context successfully initialized Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential successfully dumped into buffer Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_cred: credential successfully stored in output buffer Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: kerberos context successfully initialized Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: input buffer credential successfully unserialized Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_cred: principal successfully unparse Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: starting retry 1 of 3 Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: socket creation succeed Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: socket non-blocking flag is now set Wed Aug 15 18:26:25 2018 [INFO7] [euid=67200010,pid=25262] xstream: connect (192.168.3.46:12345) succeed while polling Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: successfully connected to auks server pood006.int.34tech.io:12345 Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: local endpoint stream 7 informations request succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: remote endpoint stream 7 informations request succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: remote host is 192.168.3.46 Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: context initialization succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: connection authentication context initialisation succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: authentication context addrs set up succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: default kstream initialisation succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: kstream basic initialisation succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: ccache initialisation succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: client kstream initialisation succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: authentication succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: NAT traversal required, setting dummy addresses Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message encryption succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message transmission succeed : 616 bytes sended Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message reception succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message decryption succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message reception succeed : 4 bytes stored Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message encryption succeed Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_krb5_stream: message transmission succeed : 4 bytes sended Wed Aug 15 18:26:25 2018 [INFO4] [euid=67200010,pid=25262] auks_api: auks cred added using default file Auks API request succeed

When launching jobs using auks plugin, I get the following error on slurmd.log (one compute node):

[2018-08-15T18:20:28.159] [551] debug: spank: /etc/slurm/plugstack.conf:1: Loaded plugin auks.so [2018-08-15T18:20:28.159] [551] debug: SPANK: appending plugin option "auks" [2018-08-15T18:20:28.165] [551] error: spank-auks: unable to get user 67200010 cred : auks api : reply seems corrupted [2018-08-15T18:20:28.165] [551] debug2: spank: auks.so: init = -200302 [2018-08-15T18:20:28.166] [551] debug2: spank: auks.so: init_post_opt = 0 [2018-08-15T18:20:28.166] [551] debug2: After call to spank_init() ...... ...... [2018-08-15T18:31:47.328] [552.0] error: couldn't chdir to `/home/test/motorBike2-64p3': Permission denied: going to /tmp instead

As expected, is not able to write on kerberized NFS homedir. :'(

Also, trying to issue:

[alex@pood006 ~]$auks -vvvvvvvv -g -u 67200010 with no luck:

Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'common' block of file /usr/local/etc/auks.conf Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'api' block of file /usr/local/etc/auks.conf Wed Aug 15 18:43:53 2018 [INFO2] [euid=67200010,pid=27110] auks_engine: initializing engine from 'renewer' block of file /usr/local/etc/auks.conf Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon is 'pood006.int.34tech.io' Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon address is 'pood006.int.34tech.io' Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon port is 12345 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine primary daemon principal is host/pood006.int.34tech.io@INT.34TECH.IO Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon is 'localhost' Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon address is 'localhost' Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon port is 12345 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine secondary daemon principal is Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine logfile is /tmp/auksapi.log Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine loglevel is 9 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine debugfile is /tmp/auksapi.log Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine debuglevel is 9 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine retry number is 3 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine timeout is 10 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine delay is 3 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine NAT traversal mode is enabled Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_loglevel is 9 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer_debuglevel is 9 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer delay is 60 Wed Aug 15 18:43:53 2018 [INFO3] [euid=67200010,pid=27110] auks_engine: engine renewer min cred lifetime is 600 Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_api: starting retry 1 of 3 Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: socket creation succeed Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: socket non-blocking flag is now set Wed Aug 15 18:43:53 2018 [INFO7] [euid=67200010,pid=27110] xstream: connect (192.168.3.46:12345) succeed while polling Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_api: successfully connected to auks server pood006.int.34tech.io:12345 Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: local endpoint stream 4 informations request succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: remote endpoint stream 4 informations request succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: remote host is 192.168.3.46 Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: context initialization succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: connection authentication context initialisation succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: authentication context addrs set up succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: default kstream initialisation succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: kstream basic initialisation succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: ccache initialisation succeed Wed Aug 15 18:43:53 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: client kstream initialisation succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: authentication succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: NAT traversal required, setting dummy addresses Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message encryption succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message transmission succeed : 8 bytes sended Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message reception succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message decryption succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message reception succeed : 4 bytes stored Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message encryption succeed Wed Aug 15 18:43:54 2018 [INFO4] [euid=67200010,pid=27110] auks_krb5_stream: message transmission succeed : 4 bytes sended Wed Aug 15 18:43:54 2018 [INFO3] [euid=67200010,pid=27110] auks_api: get request failed : bad reply type (21) Wed Aug 15 18:43:54 2018 [INFO3] [euid=67200010,pid=27110] auks_api: unable to unpack auks cred from reply : auks api : reply type is invalid Auks API request failed : auks api : reply seems corrupted [alex@pood006 ~]$ id uid=67200010(alex) gid=67200010(alex) groups=67200010(alex),67200000(admins),67200004(sudoers),67200008(vpnusers)

[alex@pood006 ~]$ ls -la /tmp/krb5cc* -rw------- 1 root root 1015 Aug 15 17:58 /tmp/krb5cc_0 -rw------- 1 root root 1015 Aug 15 17:47 /tmp/krb5cc_0_auks -rw------- 1 root root 1152 Aug 15 14:50 /tmp/krb5ccmachine_INT.34TECH.IO

[root@pood006 alex]# ls -al /var/lib/sss/db/ total 5668 drwx------. 2 sssd sssd 168 Aug 15 18:31 . drwxr-xr-x. 10 root root 120 May 24 13:36 .. -rw------- 1 root root 1613824 Aug 15 18:30 cache_int.34tech.io.ldb -rw------- 1 root root 1150 Aug 15 18:31 ccache_INT.34TECH.IO -rw-------. 1 root root 1286144 Jun 13 17:47 config.ldb -rw------- 1 root root 598 Aug 15 14:50 fast_ccache_INT.34TECH.IO -rw-------. 1 root root 1286144 May 24 13:37 sssd.ldb -rw------- 1 root root 1609728 Aug 15 18:43 timestamps_int.34tech.io.ldb

Could somebody give me a hand?

Note that compute nodes are centos + sssd + autofs + krb5 + nfs.

Thank you! Alex

hautreux commented 6 years ago

'auks -a' is a command that is commonly used by users to push their TGT to the auks server. Running that as root may work depending on the kerberos principal used at that moment. Most of the time, root is using the host principal of the form host/fqdn@realm. This principal is not automatically mapped to any users in krb5.conf so the uid to associate the TGT to in the auks server can not be found.

hautreux commented 6 years ago

'auks -vvvvvvvv -g -u 67200010' only works if the user privilege is associated to the principal. By default, the guest privilege is associated and you can only add credentials to the server, not get them. The spank plugin seems to have issues while contacting the auks server, you should check that the principal used by slurmd has the admin privilege in auksd too. Privileges are configured in the ACL file.

afont commented 6 years ago

Hi hautreux,

Ok, I understand it better now. Here's my auks.acl file:

rule { principal = ^host/int.34tech.io@INT.34TECH.IO$ ; host=; role = admin ; } rule { principal = ^[[:alnum:]]@INT.34TECH.IO$ ; host = * ; role = user ; }

Is it correct? I've reinstalled everything and installed from RPM's on slurm-manager, frontend node and 2 compute nodes.

Everything seems to work fine, all services up and running, except for auksdrenewer on slurm-manger, that is trying to connect to localhost:12345 even there's no present on auks.conf file (maybe defaults?).

Here's my auks.conf:

------------------------------------------------------------------------------

auks client and server configuration file

------------------------------------------------------------------------------

-

Common client/server elements

-

common {

Primary daemon configuration

PrimaryHost = "slurm-manager.int.34tech.io" ;

PrimaryAddress = "" ;

PrimaryPort = 12345 ; PrimaryPrincipal = "host/slurm-manager.int.34tech.io@INT.34TECH.IO" ;

Secondary daemon configuration

SecondaryHost = "auks2" ;

SecondaryAddress = "" ;

SecondaryPort = "12345" ;

SecondaryPrincipal = "host/auks2.myrealm.org@MYREALM.ORG" ;

Enable/Disable NAT traversal support (yes/no)

this value must be the same on every nodes

NAT = yes ;

max connection retries number

Retries = 3 ;

connection timeout

Timeout = 10 ;

delay in seconds between retries

Delay = 3 ;

}

-

API only elements

-

api {

log file and level

LogFile = "/tmp/auksapi.log" ; LogLevel = "9" ;

optional debug file and level

DebugFile = "/tmp/auksapi.log" ; DebugLevel = "9" ;

}

-

Auks daemon only elements

-

auksd {

Primary daemon configuration

PrimaryKeytab = "/etc/krb5.keytab" ;

Secondary daemon configuration

SecondaryKeytab = "/etc/krb5.keytab" ;

log file and level

LogFile = "/var/log/auksd.log" ; LogLevel = "1" ;

optional debug file and level

DebugFile = "/var/log/auksd.log" ; DebugLevel = "0" ;

directory in which daemons store the creds

CacheDir = "/var/cache/auks" ;

ACL file for cred repo access authorization rules

ACLFile = "/etc/auks/auks.acl" ;

default size of incoming requests queue

it grows up dynamically

QueueSize = 50 ;

default repository size (number fo creds)

it grows up dynamicaly

RepoSize = 500 ;

number of workers for incoming request processing

Workers = 10 ;

delay in seconds between 2 repository clean stages

CleanDelay = 300 ;

use kerberos replay cache system (slow down)

ReplayCache = yes ;

}

-

Auksd renewer only elements

-

renewer {

log file and level

LogFile = "/var/log/auksdrenewer.log" ; LogLevel = "1" ;

optional debug file and level

DebugFile = "/var/log/auksdrenewer.log" ; DebugLevel = "0" ;

delay between two renew loops

Delay = "60" ;

Min Lifetime for credentials to be renewed

This value is also used as the grace trigger to renew creds

MinLifeTime = "600" ;

}

I've tried also to submit jobs and run fine writing to NFS share /home directory, but I've tried to create a new user on IPA and when I submitted a job, it crashed saying permission denied on /home directory. Doing some testing, with this new test user, give the following output:

test4@fe01 ~]$ whoami test4 [test4@fe01 ~]$ id uid=67200023(test4) gid=67200023(test4) groups=67200023(test4),67200000(admins),67200004(sudoers),67200008(vpnusers) [test4@fe01 ~]$ klist Ticket cache: KEYRING:persistent:67200023:krb_ccache_P52GUs3 Default principal: test4@INT.34TECH.IO

Valid starting Expires Service principal 08/16/2018 16:38:02 08/17/2018 16:38:02 krbtgt/INT.34TECH.IO@INT.34TECH.IO

[test4@fe01 ~]$ srun klist Auks API request failed : krb5 cred : unable to read credential cache Auks API request failed : krb5 cred : unable to read credential cache klist: Credentials cache keyring 'persistent:67200023:67200023' not found srun: error: pood006: task 0: Exited with exit code 1

But the result is different if I first login and logout into the two compute nodes, through SSH, using this new user, then I try to submit the job again and it works perfectly? Is this behaviour expected? Note that if after that, I try to do the "srun klist" test procedure, I get the following results:

$klist Ticket cache: KEYRING:persistent:67200023:krb_ccache_P52GUs3 Default principal: test4@INT.34TECH.IO

Valid starting Expires Service principal 08/16/2018 16:38:02 08/17/2018 16:38:02 krbtgt/INT.34TECH.IO@INT.34TECH.IO

$srun klist Auks API request failed : krb5 cred : unable to read credential cache Auks API request failed : krb5 cred : unable to read credential cache Ticket cache: KEYRING:persistent:67200023:krb_ccache_V4aQgfL Default principal: test4@INT.34TECH.IO

Valid starting Expires Service principal 08/16/2018 16:46:13 08/17/2018 16:46:11 nfs/nas02.int.34tech.io@INT.34TECH.IO 08/16/2018 16:46:12 08/17/2018 16:46:11 krbtgt/INT.34TECH.IO@INT.34TECH.IO

Thank you for your help.

Regards, Alex

hautreux commented 6 years ago

Auks may not be fully compatible with kerberos keyring credential cache. The keyring logic was introduced in rhel7 and auks was created before that. Issue #11 was opened concerning that but I never spent time on trying to reproduce it so I can't confirm. You should increase the verbosity level of auksd and look at the journal to see if credentials add/get operations are successful or not for you.

If not, you should try to use the former credential cache logic (FILE:/...) to see if it helps make things work as expected (this requires to comment a line in your krb5.conf, as described in issue #11). It could be the case. This is the way I am using auks and have validated it.

Things may work after ssh to the compute nodes because the kerberos TGS required to access your NFS home directory is retrieved by the rpc.gssd service and pushed down to the kernel at that moment. Kernel cache timeout for TGS is very high by default which means that further accesses (including srun jobs) can reuse it transparently even after ssh disconnections. This must be the reason why it works for your new user after ssh.

hautreux commented 6 years ago

Concerning auksdrenewer, you should try to specify the PrimaryAddress conf parameter to ensure that a listening IP is used to contact the auks server.

afont commented 6 years ago

I figured it out, ok So when saying to try to use the former credential cache logic (FILE:/...), you mean using described in #23?

Thank you! Alex

hautreux commented 6 years ago

Information in #23 may be valuable for your issue too. RHEL is evolving and things around kerberos logic get different across versions even minors. I have not yet experimented with 7.5 but @robberteggermont is certainly right and what he propose sounds like a good workaround for what he saw, even if a little bit complicated to configure because of how auks works right now.

I was more thinking about the issue described in #11 in your case but depending on you OS you may need both.

afont commented 6 years ago

You should increase the verbosity level of auksd and look at the journal to see if credentials add/get operations are successful or not for you.

On slurm log shows:

9] [590.0] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_67200022/job_590/step_0' [2018-08-16T19:09:15.249] [590.0] debug2: Before call to spank_init() [2018-08-16T19:09:15.249] [590.0] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_slurmd_init' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_job_prolog' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_job_epilog' in the plugin [2018-08-16T19:09:15.253] [590.0] debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin [2018-08-16T19:09:15.253] [590.0] debug: spank: /etc/slurm/plugstack.conf:1: Loaded plugin auks.so [2018-08-16T19:09:15.253] [590.0] debug: SPANK: appending plugin option "auks" [2018-08-16T19:09:15.260] [590.0] error: spank-auks: unable to get user 67200022 cred : auks api : reply seems corrupted [2018-08-16T19:09:15.260] [590.0] debug2: spank: auks.so: init = -200302 [2018-08-16T19:09:15.260] [590.0] debug2: spank: auks.so: init_post_opt = 0 [2018-08-16T19:09:15.260] [590.0] debug2: After call to spank_init() [2018-08-16T19:09:15.260] [590.0] debug: mpi type = (null) [2018-08-16T19:09:15.260] [590.0] debug: mpi/none: slurmstepd prefork [2018-08-16T19:09:15.260] [590.0] debug3: cpu_str = 0x00010001 [2018-08-16T19:09:15.260] [590.0] debug4: eio: handling events for 1 objects

Thank you.

Alex

afont commented 6 years ago

I tried both #11 and #23 with no luck...

Alex

robberteggermont commented 6 years ago

What does the output of auksd show for the above dump request?

afont commented 6 years ago

auksd.log don't give any output when the job is launched to the cluster.

The output, for example, when I run:

$auks -g -u 67200013 -vvvv

Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : incoming socket 4 successfully dequeued Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: local endpoint stream 4 informations request succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: remote endpoint stream 4 informations request succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: remote host is 192.168.3.251 Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: context initialization succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: connection authentication context initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: authentication context addrs set up succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: default kstream initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: kstream basic initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: keytab initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: server kstream initialisation succeed Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : krb5 stream successfully initialized for socket 4 Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=26817] xqueue: dequeuing: queue is empty, waiting for element enqueuing... Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: authentication succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: connection authenticator successfully extract Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: client principal successfully stored Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: NAT traversal required, setting dummy addresses Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : authentication succeed on socket 4 (192.168.3.251) Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: principal name extraction succeed Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : alex@INT.34TECH.IO connected on socket 4 (192.168.3.251) Tue Aug 21 15:47:24 2018 [INFO5] [euid=0,pid=26817] auks_acl: alex@INT.34TECH.IO doesn't match rule regexp '^host/int.34tech.io@INT.34TECH.IO$' Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_acl: rule[0] principal check failed : rule is ^host/int.34tech.io@INT.34TECH.IO$, request is alex@INT.34TECH.IO Tue Aug 21 15:47:24 2018 [INFO5] [euid=0,pid=26817] auks_acl: alex@INT.34TECH.IO doesn't match rule regexp '^nfs/int.34tech.io@INT.34TECH.IO$' Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_acl: rule[1] principal check failed : rule is ^nfs/int.34tech.io@INT.34TECH.IO$, request is alex@INT.34TECH.IO Tue Aug 21 15:47:24 2018 [INFO5] [euid=0,pid=26817] auks_acl: alex@INT.34TECH.IO matches rule regexp '^[[:alnum:]]@INT.34TECH.IO$' Tue Aug 21 15:47:24 2018 [INFO5] [euid=0,pid=26817] auks_acl: matches rule host * Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_acl: rule[2] matches, associated role is 3 (admin) Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : alex@INT.34TECH.IO role is admin Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message reception succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message decryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message reception succeed : 8 bytes stored Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] worker[3] : marshalled request reception succeed Tue Aug 21 15:47:24 2018 [DBUG7] xlibrary: xlibrary_get_item : entering Tue Aug 21 15:47:24 2018 [DBUG8] xlibrary: xlibrary_get_item_nolock : entering Tue Aug 21 15:47:24 2018 [DBUG9] xfreelist: xfreelist_extract_item : entering Tue Aug 21 15:47:24 2018 [INFO9] [euid=0,pid=26817] xfreelist: item '1b023b0' successfully extracted from freelist '96d36430' Tue Aug 21 15:47:24 2018 [DBUG9] xfreelist: xfreelist_extract_item : exiting with status 0 Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=26817] xlibrary: get_item: no item referenced by '67200013' in tree Tue Aug 21 15:47:24 2018 [DBUG9] xfreelist: xfreelist_release_item : entering Tue Aug 21 15:47:24 2018 [INFO9] [euid=0,pid=26817] xfreelist: item '1b023b0' successfully released to freelist '96d36430' Tue Aug 21 15:47:24 2018 [DBUG9] xfreelist: xfreelist_release_item : exiting with status 0 Tue Aug 21 15:47:24 2018 [DBUG8] xlibrary: xlibrary_get_item_nolock : exiting with status -201 Tue Aug 21 15:47:24 2018 [DBUG7] xlibrary: xlibrary_get_item : exiting with status -201 Tue Aug 21 15:47:24 2018 [INFO5] [euid=0,pid=26817] auks_repo: get : unable to find '67200013' cred in the library Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : unable to get '67200013' cred from repo : auks library : cred with matching uid not found Tue Aug 21 15:47:24 2018 [INFO2] [euid=0,pid=26817] worker[3] : alex@INT.34TECH.IO from 192.168.3.251 : get request failed : auks library : cred with matching uid not found Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message encryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message transmission succeed : 4 bytes sended Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : alex@INT.34TECH.IO from 192.168.3.251 : reply successfully sent Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message reception succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message decryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] auks_krb5_stream: message reception succeed : 4 bytes stored Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=26817] worker[3] : marshalled request reception succeed Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : alex@INT.34TECH.IO from 192.168.3.251 : close request succeed Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=26817] worker[3] : incoming socket 4 processing succeed Tue Aug 21 15:47:24 2018 [DBUG7] xqueue: xqueue_dequeue : entering Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=26817] xqueue: dequeuing: queue is empty, waiting for element enqueuing...

And this is the output of auks command:

[root@slurm-manager alex]# auks -g -u 67200013 -vvv Tue Aug 21 15:47:24 2018 [INFO2] [euid=0,pid=27608] auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf Tue Aug 21 15:47:24 2018 [INFO2] [euid=0,pid=27608] auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf Tue Aug 21 15:47:24 2018 [INFO2] [euid=0,pid=27608] auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine primary daemon is 'slurm-manager.int.34tech.io' Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine primary daemon address is '192.168.3.251' Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine primary daemon port is 12345 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine primary daemon principal is host/slurm-manager.int.34tech.io@INT.34TECH.IO Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine secondary daemon is 'localhost' Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine secondary daemon address is 'localhost' Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine secondary daemon port is 12345 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine secondary daemon principal is Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine logfile is /tmp/auksapi.log Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine loglevel is 9 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine debugfile is /tmp/auksapi.log Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine debuglevel is 9 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine retry number is 3 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine timeout is 10 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine delay is 3 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine NAT traversal mode is enabled Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer_logfile is /var/log/auksdrenewer.log Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer_loglevel is 9 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer_debugfile is /var/log/auksdrenewer.log Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer_debuglevel is 9 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer delay is 60 Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_engine: engine renewer min cred lifetime is 600 Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_api: starting retry 1 of 3 Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=27608] xstream: socket creation succeed Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=27608] xstream: socket non-blocking flag is now set Tue Aug 21 15:47:24 2018 [INFO7] [euid=0,pid=27608] xstream: connect (192.168.3.251:12345) succeed while polling Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_api: successfully connected to auks server 192.168.3.251:12345 Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: local endpoint stream 4 informations request succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: remote endpoint stream 4 informations request succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: remote host is 192.168.3.251 Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: context initialization succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: connection authentication context initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: authentication context addrs set up succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: default kstream initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: kstream basic initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: ccache initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: client kstream initialisation succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: authentication succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: NAT traversal required, setting dummy addresses Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message encryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message transmission succeed : 8 bytes sended Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message reception succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message decryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message reception succeed : 4 bytes stored Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message encryption succeed Tue Aug 21 15:47:24 2018 [INFO4] [euid=0,pid=27608] auks_krb5_stream: message transmission succeed : 4 bytes sended Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_api: get request failed : bad reply type (21) Tue Aug 21 15:47:24 2018 [INFO3] [euid=0,pid=27608] auks_api: unable to unpack auks cred from reply : auks api : reply type is invalid Auks API request failed : auks api : reply seems corrupted

I've noticed also that aukspriv is having issues, after using the export option on aukspriv:

Aug 21 17:32:08 slurm-manager aukspriv[4542]: unable to get ccache for FILE:/etc/krb5.keytab using ktfile /etc/krb5.keytab : kinit: Keytab contains no suitable keys for FILE:/etc/krb5.keytab@INT.34TECH.IO while getting initial credentials

Thank you!

Alex

afont commented 6 years ago

And when running jobs, they fail immediately with this error on log:

[2018-08-21T17:51:32.458] [613] debug3: xcgroup_set_param: parameter 'notify_on_release' set to '0' for '/sys/fs/cgroup/freezer/slurm/uid_67200021/job_613/step_batch' [2018-08-21T17:51:32.458] [613] debug2: Before call to spank_init() [2018-08-21T17:51:32.458] [613] debug: spank: opening plugin stack /etc/slurm/plugstack.conf [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_slurmd_init' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_job_prolog' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_job_epilog' in the plugin [2018-08-21T17:51:32.462] [613] debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin [2018-08-21T17:51:32.462] [613] debug: spank: /etc/slurm/plugstack.conf:1: Loaded plugin auks.so [2018-08-21T17:51:32.462] [613] debug: SPANK: appending plugin option "auks" [2018-08-21T17:51:32.468] [613] error: spank-auks: unable to get user 67200021 cred : auks api : reply seems corrupted [2018-08-21T17:51:32.468] [613] debug2: spank: auks.so: init = -200302 [2018-08-21T17:51:32.468] [613] debug2: spank: auks.so: init_post_opt = 0 [2018-08-21T17:51:32.468] [613] debug2: After call to spank_init() [2018-08-21T17:51:32.468] [613] debug: mpi type = (null) [2018-08-21T17:51:32.468] [613] debug: mpi/none: slurmstepd prefork [2018-08-21T17:51:32.468] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.468] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.468] [613] debug2: Using gid list sent by slurmd [2018-08-21T17:51:32.468] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.468] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.468] [613] debug2: Entering _setup_normal_io [2018-08-21T17:51:32.468] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.468] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.468] [613] debug2: Using gid list sent by slurmd [2018-08-21T17:51:32.468] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.469] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.469] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.469] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.469] [613] debug5: stdin file name = /dev/null [2018-08-21T17:51:32.469] [613] debug5: stdout file name = /home/test2/motorBike2-64p3/slurm-613.out [2018-08-21T17:51:32.528] [613] error: Could not open stdout file /home/test2/motorBike2-64p3/slurm-613.out: Key has expired [2018-08-21T17:51:32.528] [613] debug4: eio: handling events for 1 objects [2018-08-21T17:51:32.528] [613] debug3: Called _msg_socket_readable [2018-08-21T17:51:32.528] [613] debug2: Leaving _setup_normal_io [2018-08-21T17:51:32.528] [613] error: IO setup failed: Key has expired [2018-08-21T17:51:32.529] [613] debug2: step_terminate_monitor will run

afont commented 6 years ago

Hi,

Finally got it working! Reconfiguring everything from scratch, adjusting acl and following the #23 instructions, got it running perfectly.

Thank you all for the help.

Regards, Alex