radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

T-REMD use-case fails on comet #38

Open antonst opened 9 years ago

antonst commented 9 years ago

With:

//home/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016689.0007-pilot.0000/pilot.0000-ExecWorker-0/30083.0/cmd: fork: retry: Resource temporarily unavailable

That particular run involved 384 replicas

antonst commented 8 years ago

err message for worker is:

ssh_keysign: fork: Resource temporarily unavailable
key_sign failed

This results in 8 out of 384 units never reaching 'done' state.

antonst commented 8 years ago

and this happens for both 8 exec-workers and a single exec-worker

antonst commented 8 years ago

Andre, Mark, any idea what is happening here?

andre-merzky commented 8 years ago

I would think you are running out of processes. If you don't mind, please do the following in the cu post exec: ps -ef --forest > ps.log. I assume you'll find the processes responsible in the last successful CU's ps.log. A quick guess would be that other pre- or post-process stuff is lingering? That would be the easy one. Bad would be ssh or sh procs from the agent...

marksantcroos commented 8 years ago

All CUs were executed on a single host because of an issue with the SHELL spawner which eventually led to process starvation. Fixed in https://github.com/radical-cybertools/radical.pilot/tree/fix/nested_ssh.

marksantcroos commented 8 years ago

Note that depending on the scale, you will now run into SSH resource issues, but at least the initial problem is solved.

marksantcroos commented 8 years ago

Based on my experience you dont hit this new limit if the CUs use at least 3 cores.

(Which limits the number of ssh connections effectively to 8, which probably hinsts that the limit is 10 ssh connections as 12 two core cu's doesnt work)

ibethune commented 7 years ago

Not touched since 2015 -> Backburner