hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

Auks API request failed : krb5 cred : unable to read credential cache #8

Closed sreedharmanchu closed 8 years ago

sreedharmanchu commented 8 years ago

Hi,

I am trying to make slurm work with auks. Right now I am testing it on 3 CentOS 7.1 VMs. I am using first one as a mgmt node that has all 3 auks components (auksd, aukspriv and auksdrenewer). All the 3 boxes are compute and login nodes as well.

When as a regular user, when I run srun /bin/hostname, it says it is unable to read credential cache. Right now, as shown below, permissions on /var/cache/auks are just 700. I even changed it to 755 and still error is same. I'm thinking, some other user, (is it user slurm?) is trying to read these files and rightly permission is denied. I think I'm missing something simple here. Can you take a look? I really appreciate any help.

It does print the hostname. But I am not sure whether I'm supposed to see this error. I followed your howto article and created configuration. Right now, I am not entirely sure whether my auks.acl is correct either.

It also complained "unable to parse configuration file" and I had to create a symlink to /etc/auks/auks.conf in /usr/local/etc. I saw another issue and it was stated that this issue was fixed in the latest commit. Yesterday I downloaded the zip file and I made rpms out of it.

I do have the latest version of slurm. root@slurmdev1:/u/sreedhar$ srun -V slurm 15.08.6

My configuration looks like this: ( I replaced real names)

root@slurmdev1:/u/sreedhar$ rpm -qa | grep auks auks-devel-0.4.4-1.el7.centos.x86_64 auks-slurm-0.4.4-1.el7.centos.x86_64 auks-0.4.4-1.el7.centos.x86_64

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auks.conf

------------------------------------------------------------------------------

auks client and server configuration file

------------------------------------------------------------------------------

-

Common client/server elements

-

common {

Primary daemon configuration

PrimaryHost = "slurmdev1" ;

PrimaryAddress = "" ;

PrimaryPort = 12345 ; PrimaryPrincipal = "host/slurmdev1.realm.a@REALM.A" ;

Secondary daemon configuration

SecondaryHost = "auks2" ;

SecondaryAddress = "" ;

SecondaryPort = "12345" ;

SecondaryPrincipal = "host/auks2.myrealm.org@MYREALM.ORG" ;

Enable/Disable NAT traversal support (yes/no)

this value must be the same on every nodes

NAT = no ;

max connection retries number

Retries = 3 ;

connection timeout

Timeout = 10 ;

delay in seconds between retries

Delay = 3 ;

}

-

API only elements

-

api {

log file and level

LogFile = /tmp/auksapi.log ; LogLevel = 3 ;

optional debug file and level

DebugFile = /tmp/auksapi.log ; DebugLevel = 3 ;

}

-

Auks daemon only elements

-

auksd {

Primary daemon configuration

PrimaryKeytab = "/etc/krb5.keytab" ;

Secondary daemon configuration

SecondaryKeytab = "/etc/krb5.keytab" ;

log file and level

LogFile = "/var/log/auksd.log" ; LogLevel = "2" ;

optional debug file and level

DebugFile = "/var/log/auksd.log" ; DebugLevel = "0" ;

directory in which daemons store the creds

CacheDir = "/var/cache/auks" ;

ACL file for cred repo access authorization rules

ACLFile = "/etc/auks/auks.acl" ;

default size of incoming requests queue

it grows up dynamically

QueueSize = 500 ;

default repository size (number fo creds)

it grows up dynamicaly

RepoSize = 1000 ;

number of workers for incoming request processing

Workers = 1000 ;

delay in seconds between 2 repository clean stages

CleanDelay = 300 ;

use kerberos replay cache system (slow down)

ReplayCache = no ;

}

-

Auksd renewer only elements

-

renewer {

log file and level

LogFile = "/var/log/auksdrenewer.log" ; LogLevel = "1" ;

optional debug file and level

DebugFile = "/var/log/auksdrenewer.log" ; DebugLevel = "0" ;

delay between two renew loops

Delay = "60" ;

Min Lifetime for credentials to be renewed

This value is also used as the grace trigger to renew creds

MinLifeTime = "600" ;

}

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auks.acl

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3].realm.a@REALM.A$ ; host = * ; role = admin ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^[[:alnum:]]@REALM.A$ ; host = \ ; role = user ; }

-------------------------------------------------------------------------------

root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstack.conf include /etc/slurm/plugstack.conf.d/*.conf root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstack.conf.d/auks.conf optional /usr/lib64/slurm/auks.so default=enabled spankstackcred=yes minimum_uid=1024

root@caslurmdev1:/u/sreedhar$ auks -p Auks API request succeed root@caslurmdev1:/u/sreedhar$ exit exit

sreedhar@slurmdev1:~$ auks -p Auks API init failed : unable to parse configuration file sreedhar@slurmdev1:~$ auks -vvv -p Thu Jan 21 11:26:22 2016 [INFO2] [euid=564800185,pid=53870] auks_engine: unable to parse configuration file /usr/local/etc/auks.conf : No such file or directory Auks API init failed : unable to parse configuration file

sreedhar@slurmdev1:~$ sudo bash root@slurmdev1:/u/sreedhar$ ln -s /etc/auks/auks.conf /usr/local/etc/auks.conf root@slurmdev1:/u/sreedhar$ exit exit

sreedhar@slurmdev1:~$ auks -p Auks API request succeed

sreedhar@slurmdev1:~$ srun /bin/hostname Auks API request failed : krb5 cred : unable to read credential cache slurmdev1.realm.a

sreedhar@slurmdev1:~$ sudo bash root@slurmdev1:/u/sreedhar$ ls -ld /var/cache/auks drwx------ 2 root root 71 Jan 21 10:01 /var/cache/auks root@slurmdev1:/u/sreedhar$ ls -l /var/cache/auks/* -rw------- 1 root root 1.3K Jan 19 11:10 /var/cache/auks/aukscc_16180 -rw------- 1 root root 1.2K Jan 21 10:01 /var/cache/auks/aukscc_564800185 -rw------- 1 root root 1.2K Jan 20 15:32 /var/cache/auks/aukscc_564800186 root@slurmdev1:/u/sreedhar$ srun /bin/hostname slurmdev1.realm.a root@slurmdev1:/u/sreedhar$

sreedhar@slurmdev1:~$ srun klist Auks API request failed : krb5 cred : unable to read credential cache Ticket cache: FILE:/tmp/krb5cc_564800185_DMViMeYf6S Default principal: sreedhar@realm.a

Valid starting Expires Service principal 01/20/2016 17:40:11 01/25/2016 17:40:08 krbtgt/REALM.A@REALM.A renew until 01/27/2016 17:40:08 01/21/2016 11:09:55 01/25/2016 17:40:08 host/slurmdev1.realm.a@REALM.A renew until 01/27/2016 17:40:08 sreedhar@slurmdev1:~$

hautreux commented 8 years ago

Have you the aukspriv service started on your nodes ?

Le jeu. 21 janv. 2016 17:47, sreedharmanchu notifications@github.com a écrit :

Hi,

I am trying to make slurm work with auks Right now I am testing it on 3 CentOS 71 VMs I am using first one as a mgmt node that has all 3 auks components (auksd, aukspriv and auksdrenewer) All the 3 boxes are compute and login nodes as well

When as a regular user, when I run srun /bin/hostname, it says it is unable to read credential cache Right now, as shown below, permissions on /var/cache/auks are just 700 I even changed it to 755 and still error is same I'm thinking, some other user, (is it user slurm?) is trying to read these files and rightly permission is denied I think I'm missing something simple here Can you take a look? I really appreciate any help

It does print the hostname But I am not sure whether I'm supposed to see this error I followed your howto article and created configuration Right now, I am not entirely sure whether my auksacl is correct either

It also complained "unable to parse configuration file" and I had to create a symlink to /etc/auks/auksconf in /usr/local/etc I saw another issue and it was stated that this issue was fixed in the latest commit Yesterday I downloaded the zip file and I made rpms out of it

I do have the latest version of slurm root@slurmdev1:/u/sreedhar$ srun -V slurm 15086

My configuration looks like this: ( I replaced real names)

root@slurmdev1:/u/sreedhar$ rpm -qa | grep auks auks-devel-044-1el7centosx86_64 auks-slurm-044-1el7centosx86_64 auks-044-1el7centosx86_64

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auksconf

------------------------------------------------------------------------------

auks client and server configuration file

------------------------------------------------------------------------------

-

Common client/server elements

-

common {

Primary daemon configuration

PrimaryHost = "slurmdev1" ;

PrimaryAddress = "" ;

PrimaryPort = 12345 ; PrimaryPrincipal = "host/slurmdev1realma@REALMA" ;

Secondary daemon configuration

SecondaryHost = "auks2" ;

SecondaryAddress = "" ;

SecondaryPort = "12345" ;

SecondaryPrincipal = "host/auks2myrealmorg@MYREALMORG" ;

Enable/Disable NAT traversal support (yes/no)

this value must be the same on every nodes

NAT = no ;

max connection retries number

Retries = 3 ;

connection timeout

Timeout = 10 ;

delay in seconds between retries

Delay = 3 ;

}

-

API only elements

-

api {

log file and level

LogFile = /tmp/auksapilog ; LogLevel = 3 ;

optional debug file and level

DebugFile = /tmp/auksapilog ; DebugLevel = 3 ;

}

-

Auks daemon only elements

-

auksd {

Primary daemon configuration

PrimaryKeytab = "/etc/krb5keytab" ;

Secondary daemon configuration

SecondaryKeytab = "/etc/krb5keytab" ;

log file and level

LogFile = "/var/log/auksdlog" ; LogLevel = "2" ;

optional debug file and level

DebugFile = "/var/log/auksdlog" ; DebugLevel = "0" ;

directory in which daemons store the creds

CacheDir = "/var/cache/auks" ;

ACL file for cred repo access authorization rules

ACLFile = "/etc/auks/auksacl" ;

default size of incoming requests queue

it grows up dynamically

QueueSize = 500 ;

default repository size (number fo creds)

it grows up dynamicaly

RepoSize = 1000 ;

number of workers for incoming request processing

Workers = 1000 ;

delay in seconds between 2 repository clean stages

CleanDelay = 300 ;

use kerberos replay cache system (slow down)

ReplayCache = no ;

}

-

Auksd renewer only elements

-

renewer {

log file and level

LogFile = "/var/log/auksdrenewerlog" ; LogLevel = "1" ;

optional debug file and level

DebugFile = "/var/log/auksdrenewerlog" ; DebugLevel = "0" ;

delay between two renew loops

Delay = "60" ;

Min Lifetime for credentials to be renewed

This value is also used as the grace trigger to renew creds

MinLifeTime = "600" ;

}

root@slurmdev1:/u/sreedhar$ cat /etc/auks/auksacl

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3]realma@REALMA$ ; host = * ; role = admin ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^[[:alnum:]]@REALMA https://github.com/REALMA$ ; host = \ ; role = user ; }

-------------------------------------------------------------------------------

root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstackconf include /etc/slurm/plugstackconfd/*conf root@caslurmdev1:/u/sreedhar$ cat /etc/slurm/plugstackconfd/auksconf optional /usr/lib64/slurm/auksso default=enabled spankstackcred=yes minimum_uid=1024

root@caslurmdev1:/u/sreedhar$ auks -p Auks API request succeed root@caslurmdev1:/u/sreedhar$ exit exit

sreedhar@slurmdev1:~$ auks -p Auks API init failed : unable to parse configuration file sreedhar@slurmdev1:~$ auks -vvv -p Thu Jan 21 11:26:22 2016 [INFO2] [euid=564800185,pid=53870] auks_engine: unable to parse configuration file /usr/local/etc/auksconf : No such file or directory Auks API init failed : unable to parse configuration file

sreedhar@slurmdev1:~$ sudo bash root@slurmdev1:/u/sreedhar$ ln -s /etc/auks/auksconf /usr/local/etc/auksconf root@slurmdev1:/u/sreedhar$ exit exit

sreedhar@slurmdev1:~$ auks -p Auks API request succeed

sreedhar@slurmdev1:~$ srun /bin/hostname Auks API request failed : krb5 cred : unable to read credential cache slurmdev1realma

sreedhar@slurmdev1:~$ sudo bash root@slurmdev1:/u/sreedhar$ ls -ld /var/cache/auks drwx------ 2 root root 71 Jan 21 10:01 /var/cache/auks root@slurmdev1:/u/sreedhar$ ls -l /var/cache/auks/* -rw------- 1 root root 13K Jan 19 11:10 /var/cache/auks/aukscc_16180 -rw------- 1 root root 12K Jan 21 10:01 /var/cache/auks/aukscc_564800185 -rw------- 1 root root 12K Jan 20 15:32 /var/cache/auks/aukscc_564800186 root@slurmdev1:/u/sreedhar$ srun /bin/hostname slurmdev1realma root@slurmdev1:/u/sreedhar$

sreedhar@slurmdev1:~$ srun klist Auks API request failed : krb5 cred : unable to read credential cache Ticket cache: FILE:/tmp/krb5cc_564800185_DMViMeYf6S Default principal: sreedhar@realma

Valid starting Expires Service principal 01/20/2016 17:40:11 01/25/2016 17:40:08 krbtgt/REALMA@REALMA renew until 01/27/2016 17:40:08 01/21/2016 11:09:55 01/25/2016 17:40:08 host/slurmdev1realma@REALMA renew until 01/27/2016 17:40:08 sreedhar@slurmdev1:~$

— Reply to this email directly or view it on GitHub https://github.com/hautreux/auks/issues/8.

sreedharmanchu commented 8 years ago

Hi,

Thank you for your prompt response. Yes, I believe I have aukspriv running.

Another thing I forgot to add in my last post was that auksd and auksdrenewer logs are not created anymore with this version. Before with, 0.4.0-1, logs worked fine. Now, they're not created at all. I simply restarted the services many times yet without success.

root@slurmdev1:/u/sreedhar$ systemctl status aukspriv aukspriv.service - Auks ccache from keytab scripted daemon Loaded: loaded (/usr/lib/systemd/system/aukspriv.service; enabled) Active: active (running) since Thu 2016-01-21 08:48:34 EST; 3h 37min ago Process: 50700 ExecStart=/usr/sbin/aukspriv $AUKSPRIV_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 50705 (aukspriv) CGroup: /system.slice/aukspriv.service ├─50705 /bin/bash /usr/sbin/aukspriv └─50718 sleep 35000

Jan 21 08:48:34 slurmdev1.realm.a systemd[1]: Starting Auks ccache from keytab scripted daemon... Jan 21 08:48:34 caslurmdev1.realm.a systemd[1]: Started Auks ccache from keytab scripted daemon. root@slurmdev1:/u/sreedhar$ ps aux | grep auks root 50705 0.0 0.0 115212 1572 ? S 08:48 0:00 /bin/bash /usr/sbin/aukspriv root 50723 0.0 0.1 2128244 12548 ? Ssl 08:48 0:00 /usr/sbin/auksd -F root 51729 0.0 0.0 33340 1968 ? Ss 08:48 0:00 /usr/sbin/auksdrenewer -F root 54779 0.0 0.0 112640 940 pts/2 S+ 12:26 0:00 grep auks root@slurmdev1:/u/sreedhar$

Sreedhar.

hautreux commented 8 years ago

They are now managed using systemd/ journalctl. Use 'journalctl -u auksd -f' or something similar for the other components.

Check that aukspriv is running and look at slurmd log for auks errors.

Le jeu. 21 janv. 2016 18:30, sreedharmanchu notifications@github.com a écrit :

Hi,

Thank you for your prompt response. Yes, I believe I have aukspriv running.

Another thing I forgot to add in my last post was that auksd and auksdrenewer logs are not created anymore with this version. Before with, 0.4.0-1, logs worked fine. Now, they're not created at all. I simply restarted the services many times yet without success.

root@slurmdev1:/u/sreedhar$ systemctl status aukspriv aukspriv.service - Auks ccache from keytab scripted daemon Loaded: loaded (/usr/lib/systemd/system/aukspriv.service; enabled) Active: active (running) since Thu 2016-01-21 08:48:34 EST; 3h 37min ago Process: 50700 ExecStart=/usr/sbin/aukspriv $AUKSPRIV_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 50705 (aukspriv) CGroup: /system.slice/aukspriv.service ├─50705 /bin/bash /usr/sbin/aukspriv └─50718 sleep 35000

Jan 21 08:48:34 slurmdev1.realm.a systemd[1]: Starting Auks ccache from keytab scripted daemon... Jan 21 08:48:34 caslurmdev1.realm.a systemd[1]: Started Auks ccache from keytab scripted daemon. root@slurmdev1:/u/sreedhar$ ps aux | grep auks root 50705 0.0 0.0 115212 1572 ? S 08:48 0:00 /bin/bash /usr/sbin/aukspriv root 50723 0.0 0.1 2128244 12548 ? Ssl 08:48 0:00 /usr/sbin/auksd -F root 51729 0.0 0.0 33340 1968 ? Ss 08:48 0:00 /usr/sbin/auksdrenewer -F root 54779 0.0 0.0 112640 940 pts/2 S+ 12:26 0:00 grep auks root@slurmdev1:/u/sreedhar$

Sreedhar.

— Reply to this email directly or view it on GitHub https://github.com/hautreux/auks/issues/8#issuecomment-173645285.

sreedharmanchu commented 8 years ago

Thank you for that tip. That helps. I see this in slurmd.log and auksapi.log, auksd and auksdrenewer

I see "spank-auks: unable to get user xxxx cred: auks api : reply seems corrupted".

Do you think there is something wrong with my configuration? Thanks again.

auksapi.log

Thu Jan 21 12:48:37 2016 [INFO3] [euid=0,pid=55096] auks_api: get request processing failed : auks api : connection failed

Thu Jan 21 12:48:37 2016 [INFO3] [euid=0,pid=55096] auks_api: unable to unpack auks cred from reply : auks api : request processing failed

auksd.log

Jan 21 12:48:31 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:31 2016 [INFO2] [euid=0,pid=50723] worker[629] : sreedhar@REALM.A from 10.250.17.76 : add request succeed Jan 21 12:48:31 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:31 2016 [INFO2] [euid=0,pid=50723] worker[631] : authentication failed on socket 7 (10.250.17.76) : krb5 stream : recvauth stage failed (server side) Jan 21 12:48:34 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:34 2016 [INFO2] [euid=0,pid=50723] worker[633] : authentication failed on socket 6 (10.250.17.76) : krb5 stream : recvauth stage failed (server side) Jan 21 12:48:37 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:37 2016 [INFO2] [euid=0,pid=50723] worker[635] : authentication failed on socket 7 (10.250.17.76) : krb5 stream : recvauth stage failed (server side) Jan 21 12:48:42 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:42 2016 [INFO2] [euid=0,pid=50723] worker[0] : auks cred repo cleaned in ~0s (0 creds removed)

Jan 21 12:48:52 slurmdev1.realm.a auksd[50723]: Thu Jan 21 12:48:52 2016 [INFO2] [euid=0,pid=50723] worker[637] : host/slurmdev1.realm.a@REALM.A from 10.250.17.76 : dump request succeed

auksdrenewer.log

Jan 21 12:48:52 slurmdev1.realm.a auksdrenewer[51729]: Thu Jan 21 12:48:52 2016 [INFO1] [euid=0,pid=51729] renewer: 3 creds dumped Jan 21 12:49:52 slurmdev1.realm.a auksdrenewer[51729]: Thu Jan 21 12:49:52 2016 [INFO1] [euid=0,pid=51729] renewer: 3 creds dumped Jan 21 12:50:52 slurmdev1.realm.a auksdrenewer[51729]: Thu Jan 21 12:50:52 2016 [INFO1] [euid=0,pid=51729] renewer: 3 creds dumped

Jan 21 12:51:52 slurmdev1.realm.a auksdrenewer[51729]: Thu Jan 21 12:51:52 2016 [INFO1] [euid=0,pid=51729] renewer: 3 creds dumped

slurmd.log

[2016-01-21T12:48:31.778] debug2: got this type of message 6001 [2016-01-21T12:48:31.778] debug2: Processing RPC: REQUEST_LAUNCH_TASKS [2016-01-21T12:48:31.778] launch task 103.0 request from 564800185.564800185@x.x.x.x (port 7638) [2016-01-21T12:48:31.778] debug: Checking credential with 276 bytes of sig data [2016-01-21T12:48:31.778] debug: task_p_slurmd_launchrequest: 103.0 0 [2016-01-21T12:48:31.779] debug: Calling /opt/slurm/sbin/slurmstepd spank prolog [2016-01-21T12:48:31.780] Reading slurm.conf file: /opt/slurm/etc/slurm.conf [2016-01-21T12:48:31.781] Running spank/prolog for jobid [103] uid [564800185] [2016-01-21T12:48:31.781] spank: opening plugin stack /etc/slurm/plugstack.conf [2016-01-21T12:48:31.781] /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/.conf" [2016-01-21T12:48:31.781] spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf [2016-01-21T12:48:31.782] spank: /usr/lib64/slurm/auks.so: no callbacks in this context [2016-01-21T12:48:31.789] _run_prolog: run job script took usec=10435 [2016-01-21T12:48:31.789] _run_prolog: prolog with lock for job 103 ran for 0 seconds [2016-01-21T12:48:31.792] debug level is 6. [2016-01-21T12:48:31.792] switch NONE plugin loaded [2016-01-21T12:48:31.792] setup for a launch_task [2016-01-21T12:48:31.792] AcctGatherProfile NONE plugin loaded [2016-01-21T12:48:31.792] AcctGatherEnergy NONE plugin loaded [2016-01-21T12:48:31.792] AcctGatherInfiniband NONE plugin loaded [2016-01-21T12:48:31.792] AcctGatherFilesystem NONE plugin loaded [2016-01-21T12:48:31.792] No acct_gather.conf file (/opt/slurm/etc/acct_gather.conf) [2016-01-21T12:48:31.792] Job accounting gather LINUX plugin loaded [2016-01-21T12:48:31.793] [103.0] profile signalling type Task [2016-01-21T12:48:31.793] [103.0] Message thread started pid = 55096 [2016-01-21T12:48:31.793] [103.0] task NONE plugin loaded [2016-01-21T12:48:31.793] debug: task_p_slurmd_reserve_resources: 103 0 [2016-01-21T12:48:31.793] [103.0] Checkpoint plugin loaded: checkpoint/none [2016-01-21T12:48:31.793] [103.0] mpi type = none [2016-01-21T12:48:31.793] [103.0] Before call to spankinit() [2016-01-21T12:48:31.793] [103.0] spank: opening plugin stack /etc/slurm/plugstack.conf [2016-01-21T12:48:31.793] [103.0] /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/.conf" [2016-01-21T12:48:31.793] [103.0] spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf [2016-01-21T12:48:31.794] [103.0] spank: /etc/slurm/plugstack.conf.d/auks.conf:40: Loaded plugin auks.so [2016-01-21T12:48:31.794] [103.0] SPANK: appending plugin option "auks" [2016-01-21T12:48:37.796] [103.0] spank-auks: unable to get user 564800185 cred : auks api : reply seems corrupted [2016-01-21T12:48:37.796] [103.0] spank: auks.so: init = -200302 [2016-01-21T12:48:37.796] [103.0] spank: auks.so: init_post_opt = 0 [2016-01-21T12:48:37.796] [103.0] After call to spank_init() [2016-01-21T12:48:37.796] [103.0] mpi type = (null) [2016-01-21T12:48:37.796] [103.0] mpi/none: slurmstepd prefork [2016-01-21T12:48:37.796] [103.0] Uncached user/gid: sreedhar/564800185 [2016-01-21T12:48:37.797] [103.0] Entering _setup_normal_io [2016-01-21T12:48:37.797] [103.0] Uncached user/gid: sreedhar/564800185 [2016-01-21T12:48:37.798] [103.0] Entering io_init_msg_write_to_fd [2016-01-21T12:48:37.798] [103.0] msg->nodeid = 0 [2016-01-21T12:48:37.798] [103.0] Leaving io_init_msg_write_to_fd [2016-01-21T12:48:37.798] [103.0] Leaving _setup_normal_io [2016-01-21T12:48:37.798] [103.0] debug level = 2 [2016-01-21T12:48:37.799] [103.0] IO handler started pid=55096 [2016-01-21T12:48:37.799] [103.0] Uncached user/gid: sreedhar/564800185 [2016-01-21T12:48:37.800] [103.0] spank-auks: credential renewer launched (pid=55102) [2016-01-21T12:48:37.800] [103.0] spank: auks.so: user_init = 0 [2016-01-21T12:48:37.800] [103.0] task 0 (55103) started 2016-01-21T12:48:37 [2016-01-21T12:48:37.800] [103.0] task_p_pre_launch_priv: 103.0 [2016-01-21T12:48:37.800] [103.0] Uncached user/gid: sreedhar/564800185 [2016-01-21T12:48:37.801] [103.0] adding task 0 pid 55103 on node 0 to jobacct [2016-01-21T12:48:37.804] [103.0] jag_common_poll_data: 55103 mem size 1508 302844 time 0.000000(0+0) [2016-01-21T12:48:37.804] [103.0] _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq [2016-01-21T12:48:37.804] [103.0] _get_sys_interface_freq_line: filename = /proc/cpuinfo [2016-01-21T12:48:37.804] [103.0] cpunfo_frequency=2299 [2016-01-21T12:48:37.804] [103.0] jag_common_poll_data: Task average frequency = 2299 pid 55103 mem size 1508 302844 time 0.000000(0+0) [2016-01-21T12:48:37.804] [103.0] energycounted = 0 [2016-01-21T12:48:37.804] [103.0] getjoules_task energy = 0 [2016-01-21T12:48:37.804] [103.0] job_container none plugin loaded [2016-01-21T12:48:37.804] [103.0] read_slurm_cgroup_conf: No cgroup.conf file (/opt/slurm/etc/cgroup.conf) [2016-01-21T12:48:37.804] [103.0] unable to get cgroup '(null)/cpuset' entry '(null)/cpuset/system' properties: No such file or directory [2016-01-21T12:48:37.804] [103.0] unable to get cgroup '(null)/memory' entry '(null)/memory/system' properties: No such file or directory [2016-01-21T12:48:37.804] [103.0] Sending launch resp rc=0 [2016-01-21T12:48:37.804] [103.0] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2016-01-21T12:48:37.805] [103.0] mpi type = (null) [2016-01-21T12:48:37.805] [103.0] Using mpi/none [2016-01-21T12:48:37.805] [103.0] task_p_pre_launch: 103.0, task 0 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_STACK no change in value: 8388608 [2016-01-21T12:48:37.805] [103.0] _set_limit: RLIMIT_CORE : max:inf cur:0 req:1024000000 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_CORE succeeded [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] [103.0] _set_limit: RLIMIT_NPROC : max:31192 cur:31192 req:4096 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_NPROC succeeded [2016-01-21T12:48:37.805] [103.0] _set_limit: RLIMIT_NOFILE : max:65535 cur:65535 req:16384 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_NOFILE succeeded [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] [103.0] _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 [2016-01-21T12:48:37.805] debug2: got this type of message 6004 [2016-01-21T12:48:37.805] debug2: Processing RPC: REQUEST_SIGNAL_TASKS [2016-01-21T12:48:37.805] debug: _rpc_signal_tasks: sending signal 995 to step 103.0 flag 0 [2016-01-21T12:48:37.806] [103.0] Handling REQUEST_STEP_UID [2016-01-21T12:48:37.806] [103.0] Handling REQUEST_SIGNAL_CONTAINER [2016-01-21T12:48:37.806] [103.0] _handle_signal_container for step=103.0 uid=0 signal=995 [2016-01-21T12:48:37.806] [103.0] getjoules_task energy = 0 [2016-01-21T12:48:37.806] [103.0] pid(55102) not being watched in jobacct! [2016-01-21T12:48:37.908] [103.0] getjoules_task energy = 0 [2016-01-21T12:48:37.908] [103.0] removing task 0 pid 55103 from jobacct [2016-01-21T12:48:37.908] [103.0] task 0 (55103) exited with exit code 0. [2016-01-21T12:48:37.908] [103.0] spank-auks: all tasks exited, killing credential renewer (pid=55102) [2016-01-21T12:48:37.908] [103.0] spank: auks.so: task_exit = 0 [2016-01-21T12:48:37.908] [103.0] task_p_post_term: 103.0, task 0 [2016-01-21T12:48:37.908] [103.0] No child processes [2016-01-21T12:48:37.911] [103.0] Sending SIGKILL to pgid 55096 [2016-01-21T12:48:37.911] [103.0] Waiting for IO [2016-01-21T12:48:37.911] [103.0] Closing debug channel [2016-01-21T12:48:37.911] [103.0] IO handler exited, rc=0 [2016-01-21T12:48:37.911] [103.0] Aggregated 1 task exit messages [2016-01-21T12:48:37.911] [103.0] Before call to spank_fini() [2016-01-21T12:48:37.911] [103.0] spank: auks.so: exit = 0 [2016-01-21T12:48:37.911] [103.0] After call to spank_fini() [2016-01-21T12:48:37.911] [103.0] Rank 0 has no children slurmstepd [2016-01-21T12:48:37.911] [103.0] _one_step_complete_msg: first=0, last=0 [2016-01-21T12:48:37.912] [103.0] false, shutdown [2016-01-21T12:48:37.912] [103.0] Message thread exited [2016-01-21T12:48:37.912] [103.0] done with job [2016-01-21T12:48:37.913] debug2: got this type of message 6011 [2016-01-21T12:48:37.914] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2016-01-21T12:48:37.914] debug: _rpc_terminate_job, uid = 2000 [2016-01-21T12:48:37.914] debug: task_p_slurmd_release_resources: 103 [2016-01-21T12:48:37.914] debug: credential for job 103 revoked [2016-01-21T12:48:37.914] debug2: No steps in jobid 103 to send signal 18 [2016-01-21T12:48:37.914] debug2: No steps in jobid 103 to send signal 15 [2016-01-21T12:48:37.914] debug2: set revoke expiration for jobid 103 to 1453398637 UTS [2016-01-21T12:48:37.915] debug: Waiting for job 103's prolog to complete [2016-01-21T12:48:37.915] debug: Finished wait for job 103's prolog to complete [2016-01-21T12:48:37.915] debug: Calling /opt/slurm/sbin/slurmstepd spank epilog [2016-01-21T12:48:37.916] Reading slurm.conf file: /opt/slurm/etc/slurm.conf [2016-01-21T12:48:37.917] Running spank/epilog for jobid [103] uid [564800185] [2016-01-21T12:48:37.917] spank: opening plugin stack /etc/slurm/plugstack.conf [2016-01-21T12:48:37.917] /etc/slurm/plugstack.conf: 1: include "/etc/slurm/plugstack.conf.d/*.conf" [2016-01-21T12:48:37.917] spank: opening plugin stack /etc/slurm/plugstack.conf.d/auks.conf [2016-01-21T12:48:37.918] spank: /usr/lib64/slurm/auks.so: no callbacks in this context [2016-01-21T12:48:37.925] debug: completed epilog for jobid 103 [2016-01-21T12:48:37.925] debug: Job 103: sent epilog complete msg: rc = 0

hautreux commented 8 years ago

slurmd/slurmstepd is not able to properly authenticate itself against the auksd daemon using kerberos. This must be because there is no valid krb5 credential in root 's environment on the compute node(s). This is the aim of the aukspriv service to ensure that. Please do, as root, a 'klist -fna' to see if you have a valid credential and if it is the case do a 'kvno host/slurmdev1.realm.a@REALM.A' (replace with what is meaningful here) to check that you are able to get the auksd TGT. Please give me the output of these commands too.

sreedharmanchu commented 8 years ago

Thank you for the explanation. Here is the output for the commands you suggested.

root@slurmdev1:~$ klist -fna Ticket cache: FILE:/tmp/krb5cc_564800185_ahBeIQ Default principal: sreedhar@REALM.A

Valid starting Expires Service principal 01/21/2016 19:43:10 01/26/2016 19:43:10 krbtgt/REALM.A@REALM.A renew until 01/28/2016 19:43:10, Flags: FRIA Addresses: (none)

root@slurmdev1:~$ kvno host/slurmdev1.realm.a@REALM.A host/slurmdev1.realm.a@REALM.A: kvno = 1 root@slurmdev1:~$

Thank you again.

hautreux commented 8 years ago

It seems that the ccache that you see is your user 's one, not the root's one. Maybe it is due to the use of 'su' instead of 'su -', you should consider using that latter one and give me the results again. Slurmstepd is supposed to use root 's one by default, so we will see which principal is used and if the auksd tgs was properly acquired and stored in the ccache. The principal used will have to be matched in the aukss acl admin entry.

hautreux commented 8 years ago

You have to do that on the computer node of course.

hautreux commented 8 years ago

Just for information, there is a PDF here that might be of interest : http://downloads.sourceforge.net/project/auks/publications/auks-turorial.pdf

sreedharmanchu commented 8 years ago

Hi,

Thank you. Now I did sudo su - and ran these commands on slurmdev1 and slurmdev2. Just to make sure I'm doing it right, I am going to mention what I have right now. I have 3 servers: slurmdev1, slurmdev2 and slurmdev3. First one is front end that has slurm scheduler on it and also acts as a login node. Finally, all three are compute nodes as well. I have auks.acl in /etc/auks/ only on slurmdev1 and I don't have it on other two nodes.

could you please take a look at my auks.acl to see whether it is right with the above requirements?

Thanks a million again.

Just saw the pdf link. I went through that before but I"m going to go through that again.

slurmdev1:~# klist -fna Ticket cache: FILE:/tmp/krb5cc_0 Default principal: host/slurmdev1.x.realm.com@X.REALM.COM

Valid starting Expires Service principal 01/22/2016 09:30:37 01/22/2016 19:30:37 krbtgt/X.REALM.COM@X.REALM.COM renew until 01/29/2016 09:30:37, Flags: FRIA Addresses: (none) 01/22/2016 09:31:11 01/22/2016 19:30:37 host/slurmdev1.x.realm.com@X.REALM.COM renew until 01/29/2016 09:30:37, Flags: FRAT Addresses: (none) slurmdev1:~# kvno host/slurmdev1.x.realm.com@X.REALM.COM host/slurmdev1.x.realm.com@X.REALM.COM: kvno = 1 slurmdev1:~# ssh slurmdev2 Last login: Thu Jan 21 13:31:51 2016 from slurmdev1.x.realm.com slurmdev2:~# whoami root slurmdev2:~# klist -fna Ticket cache: FILE:/tmp/krb5cc_0 Default principal: host/slurmdev2.x.realm.com@X.REALM.COM

Valid starting Expires Service principal 01/22/2016 13:44:55 01/22/2016 23:44:55 krbtgt/X.REALM.COM@X.REALM.COM renew until 01/29/2016 13:44:55, Flags: FRIA Addresses: (none) slurmdev2:~# kvno host/slurmdev2.x.realm.com@X.REALM.COM host/slurmdev2.x.realm.com@X.REALM.COM: kvno = 1 slurmdev2:~# kvno host/slurmdev1.x.realm.com@X.REALM.COM host/slurmdev1.x.realm.com@X.REALM.COM: kvno = 1 slurmdev2:~# exit logout Connection to slurmdev2 closed. slurmdev1:~# cat /etc/auks/auks.acl

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3].x.realm.com@X.REALM.COM$ ; host = * ; role = admin ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3].x.realm.com@X.REALM.COM$ ; host = * ; role = user ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^[[:alnum:]]@X.REALM.COM$ ; host = \ ; role = user ; }

-------------------------------------------------------------------------------

slurmdev1:~#

caslurmdev1:~#

caslurmdev1:~# cat /etc/auks/auks.acl cat output1.txt

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3].x.realm.com@X.REALM.COM$ ; host = * ; role = admin ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^host/slurmdev[1-3].x.realm.com@X.REALM.COM$ ; host = * ; role = user ; }

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

rule { principal = ^[[:alnum:]]@X.REALM.COM$ ; host = \ ; role = user ; }

-------------------------------------------------------------------------------

caslurmdev1:~#

hautreux commented 8 years ago

Everything seems fine, that is strange.

On slurmdev2, as root, do :

auks -vvvvvvv -g -u 564800185 -c /tmp/ccache

This is supposed to get your user's ticket into /tmp/ccache. It should fail as it is what slurmstepd is doing too (roughly). If the error is not explicit increase the api debug and log level in auks.conf try again and give me all the outputs.

sreedharmanchu commented 8 years ago

surprisingly, it looks like it succeeded though. One thing that looks strange is I don't have secondary daemon and it shows up as localhostl. I don't know whether it causes any problem though.

slurmdev2:~# auks -vvvvvvv -g -u 564800185 -C /tmp/ccache Fri Jan 22 15:35:44 2016 [INFO2] [euid=0,pid=61194] auks_engine: initializing engine from 'common' block of file /etc/auks/auks.conf Fri Jan 22 15:35:44 2016 [INFO2] [euid=0,pid=61194] auks_engine: initializing engine from 'api' block of file /etc/auks/auks.conf Fri Jan 22 15:35:44 2016 [INFO2] [euid=0,pid=61194] auks_engine: initializing engine from 'renewer' block of file /etc/auks/auks.conf Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine primary daemon is 'slurmdev1' Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine primary daemon address is 'slurmdev1' Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine primary daemon port is 12345 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine primary daemon principal is host/slurmdev1.x.realm.com@X.REALM.COM Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine secondary daemon is 'localhost' Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine secondary daemon address is 'localhost' Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine secondary daemon port is 12345 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine secondary daemon principal is Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine logfile is /var/log/auks/auksapi.log Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine loglevel is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine debugfile is /var/log/auks/auksapi.log Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine debuglevel is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine retry number is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine timeout is 10 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine delay is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine NAT traversal mode is disabled Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer_logfile is /var/log/auks/auksdrenewer.log Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer_loglevel is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer_debugfile is /var/log/auks/auksdrenewer.log Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer_debuglevel is 3 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer delay is 60 Fri Jan 22 15:35:44 2016 [INFO3] [euid=0,pid=61194] auks_engine: engine renewer min cred lifetime is 600 Auks API request succeed

slurmdev2:~# ls -l /tmp/ccache -rw------- 1 root root 1.2K Jan 22 15:35 /tmp/ccache

hautreux commented 8 years ago

So auks works and is properly configured, but it does not work when used by slurmd/slurmstepd. Can you confirm running a job on the same node ?

Do you have selinux enabled on your system ? It might be the reason why slurm is not able to use root credential cache.

sreedharmanchu commented 8 years ago

I submitted a job and it does produce output. It seems to work. And then it also has same error I am getting for srun.

selinux is also disabled. So, it looks like I'm missing some configuration some where.

sreedhar@slurmdev2:~/jobs$ cat submit1.sh

!/bin/bash

#

SBATCH --job-name=test1

SBATCH --output=res1.txt

#

SBATCH --ntasks=1

SBATCH --time=10:00

SBATCH --mem-per-cpu=100

srun hostname srun sleep 6 uptime sreedhar@slurmdev2:~/jobs$ sbatch submit1.sh Submitted batch job 127 sreedhar@slurmdev2:~/jobs$ cat res1.txt Auks API request failed : krb5 cred : unable to read credential cache Auks API request failed : krb5 cred : unable to read credential cache slurmdev1.x.realm.com Auks API request failed : krb5 cred : unable to read credential cache 15:48:19 up 79 days, 4:55, 5 users, load average: 0.00, 0.01, 0.05 sreedhar@slurmdev2:~/jobs$ sudo su - [sudo] password for sreedhar: Last login: Fri Jan 22 15:42:46 EST 2016 from slurmdev1.x.realm.com on pts/0 slurmdev2:~# sestatus SELinux status: disabled slurmdev2:~# caslurmdev2:~#

Thanks, Sreedhar.

sreedharmanchu commented 8 years ago

I took out srun from the submit script and I got the same error for sbatch too.

sreedhar@slurmdev2:~/jobs$ cat submit1.sh

!/bin/bash

#

SBATCH --job-name=test1

SBATCH --output=res1.txt

#

SBATCH --ntasks=1

SBATCH --time=10:00

SBATCH --mem-per-cpu=100

hostname sleep 6 uptime sreedhar@slurmdev2:~/jobs$ cat res1.txt Auks API request failed : krb5 cred : unable to read credential cache slurmdev1.x.realm.com 15:58:48 up 79 days, 5:05, 5 users, load average: 0.00, 0.01, 0.05 sreedhar@slurmdev2:~/jobs$

hautreux commented 8 years ago

Take a look at the pdf, you may have missed something. Check your spank configuration.

Try to increase the AUKS slurm and auks api log/debug levels on the compute nodes and look at the outputs. You need to validate that slurmd is using /tmp/krb5cc_0 when trying to contact the auksd daemon.

hautreux commented 8 years ago

I think that 'Auks API request failed : krb5 cred : unable to read credential cache' is generated by slurmstepd when trying to retrieve the ticket of the user. So slurmstepd has difficulties to read its ccache. Multiple possibles reasons :

sreedharmanchu commented 8 years ago

You are genius. Thank you so much. That was the issue. For whatever stupid reason, I made myself as an admin when I added myself as user with sacctmgr. So, slurmstepd was doing it with my ccache.

slurmdev1:~# cat /proc/7427/environ | tr '\0' '\n' | grep KRB KRB5CCNAME=FILE:/tmp/krb5cc_564800185_oX5Fgz

Just now I changed it and immediately it started working.

Again, thank you a ton. I really appreciate it for patiently dealing with me. It's first time I'm dealing with both slurm and kerbeors (worked with torque and moab for 5 years).

I checked it on all the nodes and it's working perfectly.

Thank you very much again.

Best regards, Sreedhar.

hautreux commented 8 years ago

Great !

Enjoy your Slurm+auks setup :)