Open antonst opened 9 years ago
err message for worker is:
ssh_keysign: fork: Resource temporarily unavailable
key_sign failed
This results in 8 out of 384 units never reaching 'done' state.
and this happens for both 8 exec-workers and a single exec-worker
Andre, Mark, any idea what is happening here?
I would think you are running out of processes. If you don't mind, please do the following in the cu post exec: ps -ef --forest > ps.log
. I assume you'll find the processes responsible in the last successful CU's ps.log. A quick guess would be that other pre- or post-process stuff is lingering? That would be the easy one. Bad would be ssh or sh procs from the agent...
All CUs were executed on a single host because of an issue with the SHELL spawner which eventually led to process starvation. Fixed in https://github.com/radical-cybertools/radical.pilot/tree/fix/nested_ssh.
Note that depending on the scale, you will now run into SSH resource issues, but at least the initial problem is solved.
Based on my experience you dont hit this new limit if the CUs use at least 3 cores.
(Which limits the number of ssh connections effectively to 8, which probably hinsts that the limit is 10 ssh connections as 12 two core cu's doesnt work)
Not touched since 2015 -> Backburner
With:
That particular run involved 384 replicas