hautreux / auks

Kerberos credential support for batch environments
Other
20 stars 18 forks source link

sbatch in loop is waiting on auks replies for 300 seconds #17

Closed sreedharmanchu closed 6 years ago

sreedharmanchu commented 8 years ago

Hi Matthew,

I came across this new issue just today. I am trying to submit a simple test script 50 times in a loop.

for i in {1..50};do sbatch test.sh;done

When cluster is busy, around 40 calls to sbatch succeed immediately and then the remaining ones just sit there for 300 seconds. First I thought something was going on with slurm. I checked netstat and realized all sbatch calls were waiting for replies from auks (all trying to talk to port 12345 on auks server).

Then I restarted auks daemon and immediately they finished. If I don't restart they just sit there for exactly 5 minutes before they finish.

Now cluster is bit free and then I increased it to 100 and 500, and most of the times they all finish. This makes me think that I need to increase number of threads. Right now I have 1000 workers, queue size of 500, repo size of 1000, clean dealy of 300, and reply cache set to no.

what do you recommend? Increasing workers or some other value? I am not sure whether these 300 seconds has anything to do with clean dealy.

I am just not sure how many threads/workers I can allocate for auks. We definitely submit many jobs in a short amount of time.

If you have any recommendation please let me know. If you think it has nothing to do with workers at all, then please let me know if you have any thoughts on how I can fix this if there is another configuration variable I need to adjust.

Thanks, Sreedhar.

sreedharmanchu commented 8 years ago

Hi,

Just following up on this. I found out auksd process was hitting soft file descriptor limit and I increased the limits and all the problems we had went away.

First, I tried to add it to /etc/sysconfig/auks hoping it'd pick it but didn't work. So, I ended up doing this and it worked.

on server where auksd is running:

cat /etc/systemd/system/auksd.service.d/filelimit.conf [Service] LimitNOFILE=4096

I had to created auksd.service.d directory before I put this file.

Now we can see this file information in systemctl status auksd as well. Checking the limits can be done in /proc//limits

Thanks Sreedhar.

hautreux commented 6 years ago

Thanks for sharing your feedback and tuning.