open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

SSH launch fails when host file has more than 64 hosts #6198

Open bwbarrett opened 5 years ago

bwbarrett commented 5 years ago

We're seeing launch failures when the host file has more than 64 hosts, which is resolved with --mca routed direct MCA parameter. Platform was x86_64 Linux in EC2. Each instance has 2 cores (4 hyperthreads). Hostfile looked like:

172.31.16.122
172.31.16.222
172.31.16.67
172.31.16.80
172.31.17.114
172.31.17.135
172.31.17.173
172.31.17.178
172.31.17.181
172.31.17.235
172.31.17.244
172.31.17.254
172.31.17.26
172.31.17.7
172.31.18.106
172.31.18.143
172.31.18.187
172.31.18.28
172.31.18.36
172.31.18.82
172.31.19.153
172.31.19.31
172.31.19.64
172.31.19.99
172.31.20.109
172.31.20.139
172.31.20.45
172.31.20.48
172.31.20.54
172.31.20.92
172.31.21.198
172.31.21.247
172.31.21.35
172.31.21.49
172.31.22.105
172.31.22.187
172.31.22.233
172.31.22.96
172.31.22.97
172.31.23.139
172.31.23.15
172.31.23.17
172.31.23.176
172.31.23.18
172.31.23.197
172.31.23.226
172.31.23.240
172.31.23.46
172.31.23.59
172.31.24.106
172.31.24.125
172.31.24.134
172.31.24.153
172.31.24.159
172.31.24.190
172.31.24.59
172.31.24.64
172.31.25.105
172.31.25.147
172.31.25.204
172.31.25.205
172.31.25.74
172.31.26.126
172.31.26.146
172.31.26.232
172.31.26.254
172.31.26.65
172.31.26.69
172.31.27.129
172.31.27.148
172.31.27.184
172.31.27.198
172.31.27.234
172.31.27.28
172.31.27.35
172.31.28.13
172.31.28.22
172.31.28.221
172.31.28.30
172.31.28.38
172.31.28.75
172.31.29.20
172.31.29.232
172.31.29.40
172.31.29.42
172.31.29.46
172.31.29.63
172.31.29.78
172.31.30.21
172.31.30.245
172.31.30.31
172.31.30.48
172.31.30.82
172.31.31.126
172.31.31.159
172.31.31.82
rhc54 commented 5 years ago

Something has borked the routed setup as the default radix is 64. Either we aren't computing the routes or the table is wrong.

rhc54 commented 5 years ago

BTW: easiest way to test with only a couple of nodes is to add --mca routed_radix 1 to your cmd line - this basically creates a linear "tree".

bwbarrett commented 5 years ago

Yeah :(. "Humorously", I still have your email from 12/19/17 with instructions on configuring routed so we catch these issues in MTT / CI. Guess I should have acted on that.

rhc54 commented 5 years ago

Any progress on this? I should think it a blocker for the branches.

gpaulsen commented 5 years ago

@rhc54 mentioned on the call today, that he may have a fix for this.

mkre commented 5 years ago

We're also running into this problem on AWS, but luckily there as an easy workaround (-mca routed direct). I've got two questions, though:

  1. We don't see this issue on an Infiniband cluster. Is it possible that this only affects Ethernet clusters?
  2. Are there any negative consequences (such as higher startup times) to be feared from the workaround? We are trying to find out under which circumstances we should employ it.
jsquyres commented 5 years ago

@rhc54 Is pretty sure that he fixed this on master.

@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/

According to https://github.com/open-mpi/ompi/issues/6786#issuecomment-507918910, it looks like it is still broken on the v4.0 branch as of 29 June 2019. If it is, indeed, fixed on master, @rhc54 graciously said he'd try to track down a list of commits that fixed the issue for us so that we can port them to the v4.0.x branch.

mkre commented 5 years ago

@jsquyres, we'll test this and report back, but it may take us a couple of days.

mkre commented 5 years ago

Does anyone have an idea under which circumstances this issue appears? As I said, so far we couldn't see this issue on one of our InfiniBand clusters, but only an AWS. Could it be the case that Open MPI takes a different code path on those systems, or are we just lucky with the IB system?

mkre commented 5 years ago

@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/

@jsquyres, we have tested this and I can confirm that the hang is resolved with the nightly snapshot.

jjhursey commented 5 years ago

I posted a fix to the plm/rsh component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might.

mkre commented 4 years ago

@jjhursey, sorry for the late answer. I can confirm that the issue is fixed in Open MPI 4.0.2, but still persists on 3.1.5.

jjhursey commented 4 years ago

Looks like the 3.1.5 issue is reported in Issue #7087 as well.

gpaulsen commented 4 years ago

Removing Target: Master and Target: v4.0.x labels, as this issue is now fixed in those branches.

gpaulsen commented 4 years ago

FYI @mwheinz may also be interested in this fix on v3.1.x

mwheinz commented 4 years ago

Do we know what change fixed this in the 4.0.x branch? If we knew that I could try to back-port it myself...

rhc54 commented 4 years ago

PR #6944 fixes the issue for the v4.0.x branch.

as stated above.