Rare issue causing complete SLURM jobs to remain in the queue

hautreux / auks

Kerberos credential support for batch environments

Other

20 stars 18 forks source link

Rare issue causing complete SLURM jobs to remain in the queue #59

Closed trenta closed 2 years ago

trenta commented 3 years ago

This is an issue that happens maybe every second week. The result being finished SLURM jobs still in the queue with RUNNING status.

The relevant error messages I'm getting are

error: spank-auks: Error while initializing a new unique and error: spank-auks: Unable to destroy ccache

Any idea what's happening here?

trenta commented 3 years ago

The problem doesn't seem to be that rare. It seems to occur when a large number of jobs are launched at once and only effects some of the jobs.

hautreux commented 2 years ago

It could be interesting to increase the API verbosity in auks.conf and collect the traces when the errors occur.

The commit 752645b2ddce8aa4e4e96f982dd6cf38d8b0e5ed fixes an issue in ccache switching logic that might be an origin of the problem.

You could also consider disabling the switching logic to see if it helps (see 98093b1f3e9c764a8775ab9912c18b0de60b0175).

You could also revert to using file backed ccache (should work again with auks >= 0.5.3, see 53347ab9a10246f2b9108039343887d1ae3adef8)

trenta commented 2 years ago

Really sorry. I thought I'd replied and closed this. The issue was that I'd left the default for secrets -> max_secrets, secrets/kcm -> max_uid_secrets and kcm -> max_uid_ccaches in the sssd config. Took me a little while to find but now I've set all three to 0 (unlimited) and am no longer seeing the issue. I haven't found it that straight forward to get everything just right in sssd. Anyway not sure if all 3 settings related but for us there wasn't a downside to setting them all to unlimited.

Thanks for responding.

Cheers

Trent