Open bwbarrett opened 5 years ago
Something has borked the routed setup as the default radix is 64. Either we aren't computing the routes or the table is wrong.
BTW: easiest way to test with only a couple of nodes is to add --mca routed_radix 1
to your cmd line - this basically creates a linear "tree".
Yeah :(. "Humorously", I still have your email from 12/19/17 with instructions on configuring routed so we catch these issues in MTT / CI. Guess I should have acted on that.
Any progress on this? I should think it a blocker for the branches.
@rhc54 mentioned on the call today, that he may have a fix for this.
We're also running into this problem on AWS, but luckily there as an easy workaround (-mca routed direct
). I've got two questions, though:
@rhc54 Is pretty sure that he fixed this on master.
@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/
According to https://github.com/open-mpi/ompi/issues/6786#issuecomment-507918910, it looks like it is still broken on the v4.0 branch as of 29 June 2019. If it is, indeed, fixed on master, @rhc54 graciously said he'd try to track down a list of commits that fixed the issue for us so that we can port them to the v4.0.x branch.
@jsquyres, we'll test this and report back, but it may take us a couple of days.
Does anyone have an idea under which circumstances this issue appears? As I said, so far we couldn't see this issue on one of our InfiniBand clusters, but only an AWS. Could it be the case that Open MPI takes a different code path on those systems, or are we just lucky with the IB system?
@mkre @bwbarrett @dfaraj can you try a nightly snapshot from master and see if the problem is resolved? See https://www.open-mpi.org/nightly/master/
@jsquyres, we have tested this and I can confirm that the hang is resolved with the nightly snapshot.
I posted a fix to the plm/rsh
component that resolves a mismatch between the tree spawn and the remote routed component (see Issue #6618 for details). PR #6944 fixes the issue for the v4.0.x branch. Can you give that a try to see if it resolves this issue? I think it might.
@jjhursey, sorry for the late answer. I can confirm that the issue is fixed in Open MPI 4.0.2, but still persists on 3.1.5.
Looks like the 3.1.5 issue is reported in Issue #7087 as well.
Removing Target: Master and Target: v4.0.x labels, as this issue is now fixed in those branches.
FYI @mwheinz may also be interested in this fix on v3.1.x
Do we know what change fixed this in the 4.0.x branch? If we knew that I could try to back-port it myself...
PR #6944 fixes the issue for the v4.0.x branch.
as stated above.
We're seeing launch failures when the host file has more than 64 hosts, which is resolved with
--mca routed direct
MCA parameter. Platform was x86_64 Linux in EC2. Each instance has 2 cores (4 hyperthreads). Hostfile looked like: