TUU example with 1000 replicas fails on Gordon

radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.

Other

4 stars 3 forks source link

TUU example with 1000 replicas fails on Gordon #41

Closed antonst closed 8 years ago

antonst commented 8 years ago

with CU STDERR:

ssh_exchange_identification: Connection closed by remote host

antonst commented 8 years ago

strangely enough for smaller runs everything works fine. I would assume there are some nodes on gordon for which ssh keys are not properly configured.

haoyuanchen commented 8 years ago

Do you mean that the job isn't even starting with 1000 replicas, or it runs but failed halfway? If it failed halfway running, it could be that some replicas just didn't run for some reason. I saw this on Gordon too.

A quick check is to go to the job directory in the sandbox and do

tail -n 1 unit*/*mdout

A normally terminated Amber job should say something like:

| wallclock() was called 290025 times

If something else, or not even a mdout file there, then that replica failed.

marksantcroos commented 8 years ago

Given that this is also a SDSC site, it might be the same SSH rate limiting as we see on Comet.

antonst commented 8 years ago

Do you mean that the job isn't even starting with 1000 replicas, or it runs but failed halfway?

No job starts and some of the CU's finish the first Amber run, while others don't even start.

antonst commented 8 years ago

Given that this is also a SDSC site, it might be the same SSH rate limiting as we see on Comet.

I would like to understand if this is specific to my account on Gordon. Mark, do you have the data for experiments with tasks other than bin/sleep or bin/date on Gordon which utilize more than 1000 cores concurrently?

marksantcroos commented 8 years ago

Its a per-node cu limit, not a total concurrent cu limit afaik.

marksantcroos commented 8 years ago

But I dont have such/any data on Gordon.

antonst commented 8 years ago

Its a per-node cu limit, not a total concurrent cu limit afaik.

So in theory if we use more cores per CU we can avoid this issue?

marksantcroos commented 8 years ago

So in theory if we use more cores per CU we can avoid this issue?

Yes, at least on Comet I empirically verified that.

antonst commented 8 years ago

Great! Will give it a try.