open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

Fix OSC RDMA to work in cases where one BTL cannot reach all communicator group members #7830

Closed jsquyres closed 2 years ago

jsquyres commented 4 years ago

Cisco MTT shows that MPI_WIN_CREATE is failing since the OSC pt2pt component was removed (i.e., MPI_WIN_CREATE fails because there is no OSC component available). Here's an example of a simple test failing because MPI_WIN_CREATE failed.

After discussion on the 16 June 2020 weekly Open MPI webex, @hjelmn said that he could add a few functions to make OSC RDMA work in all cases (not just in cases where a single BTL can reach all members in the communicator group).

Marking this as a blocker for v5.0 because it's preventing MPI_WIN_CREATE from working in at least some environments, and because -- per the discussion between @hjelmn and @bwbarrett -- the fix to OSC RDMA doesn't seem too difficult.

hjelmn commented 4 years ago

I have the implementation mostly complete. It will require reworking how BTL receive completions are handled but that is a good thing. Been looking at reworking those for years. I have a branch with the changes but I can not test usnic or portals4. I verified vader, tcp, and self. Will validate uGNI and UCX today.

jsquyres commented 4 years ago

Send me the branch and what tests to run, and I'll check usnic.

hjelmn commented 4 years ago

https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome

This only has the changes to the BTL receive logic. Please run a standard set of two-sided tests. IMB would work. Feel free to push back and changes btl/usnic needs.

The old receive callback got btl, tag, descriptor, cbdata. The new callback gets the btl and a new receive descriptor which contains the tag, endpoint, received message, and cbdata.

hjelmn commented 4 years ago

Really curious if there are any performance regressions. There shouldn't be but that is going to be where any pushback will be.

hjelmn commented 4 years ago

@jsquyres Ping. Please take a look at the branch mentioned above. I need to know if this change works for you as I can not verify the usnic BTL changes. I want to get this change approved and in this week so I can get the btl base atomics support finished.

jsquyres commented 4 years ago

@hjelmn Sorry for the delay. I ran the https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome branch last night and things generally look good. I'm getting some random failures, but I'm apparently getting some random failures with master, too. Specifically: with the btl_base_atomics_are_awesome branch, I'm seeing most (all?) of the one-sided tests that were failing are now passing.

We'll have to address the other random failures in another issue.

jsquyres commented 4 years ago

This is still causing terabytes of corefiles with Cisco's MTT that I regularly have to manually clear out to avoid filling my filesystem.

I'm going to have to disable Cisco MTT's one-sided testing until this is fixed.

jsquyres commented 4 years ago

I don't know if this is related to #7892 or not, but I figured I'd put in the cross-reference.

gpaulsen commented 3 years ago

Austen said he has some time next week to take a look. Thanks!

awlauria commented 3 years ago

osc/rdma will internally call MPI_Comm_split(MPI_COMM_TYPE_NODE) to create a local group, and is getting some interesting/weird results. For example, I'm seeing the comm size on return be 3 when running 4 processes across 2 nodes, with 2 procs per node.

I have to think that https://github.com/open-mpi/ompi/issues/9010 is related.

gpaulsen commented 3 years ago

@jsquyres @awlauria #9010 has been closed (some PMIx/PRRTE change, not root caused), so perhaps this issue also just needs to be retested?

awlauria commented 3 years ago

It helps but I still run into issues.

Unfortunately I haven't had time to press on it much since the last update, and am out next week. Earliest I can give cycles on it would be the 28th.

gpaulsen commented 3 years ago

oops, I accidentally clicked close, then re-opened it immediately.

jsquyres commented 3 years ago

FYI @bosilca

hjelmn commented 3 years ago

I plan to take a look this week.

gpaulsen commented 3 years ago

@hjelmn Have you been able to reproduce?

gpaulsen commented 3 years ago

@bosilca Did you give this a try? This may just be IBM setup? If so this wouldn't be a blocker.

bosilca commented 3 years ago

Bad news. I run the onesided tests on one of our clusters with the current master (gf882930). They pass with the UCX OSC, and with RDMA OSC as long as only the sm BTL is in use. As soon as I spread the rank across multiple nodes all one-sided tests fail (basic two-sided test pass so my BTL selection is legit).

hjelmn commented 3 years ago

Same issue as described here? Where the comm split is wrong?

bosilca commented 3 years ago

The communicator seems to be correct. I attached to one of the processes before it segfault, and the issue is in osc_rdma_lock.h, function ompi_osc_rdma_btl_fop, where the state_index is -1. The stack and peer variable do not look corrupted or uninitialized but both btl_index (state and data) of the peer object are set to 255 (because unsigned) and the selected_btl is then some memory location outside of the allocated array of BTLs.

wzamazon commented 3 years ago

@bosilca Can you elaborate on the error you are seeing? what type of btl? what test? This may be related to https://github.com/open-mpi/ompi/issues/8983.

bosilca commented 3 years ago

Any test from the onesided directory in ompi-tests. BTL: tcp and sm. As I said above I see either segfaults or deadlocks.

rhc54 commented 3 years ago

FWIW: the Cisco MTT now only shows segfaults from onesided tests and from the datatype "pack" function. It is otherwise running pretty well. Strictly tcp and sm there.

bosilca commented 3 years ago

I applied the patch from #9400 and rerun the tests. Similar story as before, segfaults and deadlocks. To give a precise example, here is how I run it:

$ salloc -N 2 -wa,b
$ mpirun -np 4 --map-by node --mca osc rdma --mca btl tcp,self -ca btl_tcp_if_include ib0 ./test_dan1

The test segfault in lock_unlock stage. However, the same test segfault when using UCX as well.

$ mpirun -np 4 --map-by node --mca pml ucx ./test_dan1
wzamazon commented 3 years ago

@bosilca I can reproduce the segfault when running test_dan1, I will look into it.

wzamazon commented 3 years ago

@bosilca The segfault from the lock_unlock stage in test_dan1 is because of locking type being two level. When locking type is two level, a process will first lock a global leader then lock the peer. The segfault happens when global leader is the process itself, in that case, because btl/tcp cannot do self communication, btl_index is 255, and selected_btl is NULL. adding --mca osc_rdma_locking_mode on_demand fixed this segfault. If I understand correctly, two level locking is an optimization, not a mandate, so we should just disable two level locking when btl/tcp is used.

However, even with on_demand locking, I encountered numerous issues like hang, segfault, and data corruption when running other tests under ompi/onesided.

I will make a list and keep looking into it.

hjelmn commented 3 years ago

Honestly let's just change the default to on demand locking in this case. The performance is not bad with on demand.

wzamazon commented 3 years ago

I just updated https://github.com/open-mpi/ompi/pull/9400 to address several new issues I found

The biggest issue continue to be that btl/tcp does not support self communication. To address this issue, I added https://github.com/open-mpi/ompi/pull/9400/commits/b0197d91d828401ba0b7b8593e026f2215cb70eb which uses btl/self for self communication (if selected btl does not support self communication).

To make it work, I had to do some refactor work , and make some change to btl/self, and change to osc/rdma.

Then there is another bug in osc/rdma in that the shm file name is not unique. I added https://github.com/open-mpi/ompi/pull/9400/commits/3c0ae03e48dd543ca941ac29e26953e4c89b2158 to address it.

With the current PR, most of the test under ompi-tests/onesided directory pass. except:

One negative test in t_winerror failed. The test was trying to free a non-existent window, and expect MPI_Win_free() to return MPI_WIN_ERROR, however, the error is considered fatal, so MPI aborted. I am not sure how (or should) we can fix it. Similar issue with test_start1.

lock4, get2 randomly hang on two nodes, I will continue investigate

put1, put4 data validation error on two nodes. I will continue investigate.

hjelmn commented 3 years ago

Catching up. Can btl/sm not be used? It allows self communication. When using tcp it should always also use btl/sm.

hjelmn commented 3 years ago

But I could be remembering incorrectly :).

wzamazon commented 3 years ago

Can btl/sm not be used?

Problem is people are running tests with the sm btl excluded (by specifying --mca btl self,tcp).

I do not have the full background. I wonder if there is an agreement that btl/tcp must be used with btl/sm (for osc/rdma)?

gpaulsen commented 2 years ago

@bosilca You're assigned to this, but are you working this?

wzamazon commented 2 years ago

https://github.com/open-mpi/ompi/pull/9696 will fix this issue too.

awlauria commented 2 years ago

@jsquyres can we close this?

awlauria commented 2 years ago

@jsquyres ping. can this be closed?

awlauria commented 2 years ago

I think this is done.. @jsquyres if this is still an issue please re-open.