Closed trenta closed 2 years ago
The problem doesn't seem to be that rare. It seems to occur when a large number of jobs are launched at once and only effects some of the jobs.
It could be interesting to increase the API verbosity in auks.conf and collect the traces when the errors occur.
The commit 752645b2ddce8aa4e4e96f982dd6cf38d8b0e5ed fixes an issue in ccache switching logic that might be an origin of the problem.
You could also consider disabling the switching logic to see if it helps (see 98093b1f3e9c764a8775ab9912c18b0de60b0175).
You could also revert to using file backed ccache (should work again with auks >= 0.5.3, see 53347ab9a10246f2b9108039343887d1ae3adef8)
Really sorry. I thought I'd replied and closed this. The issue was that I'd left the default for secrets -> max_secrets, secrets/kcm -> max_uid_secrets and kcm -> max_uid_ccaches in the sssd config. Took me a little while to find but now I've set all three to 0 (unlimited) and am no longer seeing the issue. I haven't found it that straight forward to get everything just right in sssd. Anyway not sure if all 3 settings related but for us there wasn't a downside to setting them all to unlimited.
Thanks for responding.
Cheers
Trent
This is an issue that happens maybe every second week. The result being finished SLURM jobs still in the queue with RUNNING status.
The relevant error messages I'm getting are
error: spank-auks: Error while initializing a new unique
anderror: spank-auks: Unable to destroy ccache
Any idea what's happening here?