Closed jsquyres closed 2 years ago
I have the implementation mostly complete. It will require reworking how BTL receive completions are handled but that is a good thing. Been looking at reworking those for years. I have a branch with the changes but I can not test usnic or portals4. I verified vader, tcp, and self. Will validate uGNI and UCX today.
Send me the branch and what tests to run, and I'll check usnic.
https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome
This only has the changes to the BTL receive logic. Please run a standard set of two-sided tests. IMB would work. Feel free to push back and changes btl/usnic needs.
The old receive callback got btl, tag, descriptor, cbdata. The new callback gets the btl and a new receive descriptor which contains the tag, endpoint, received message, and cbdata.
Really curious if there are any performance regressions. There shouldn't be but that is going to be where any pushback will be.
@jsquyres Ping. Please take a look at the branch mentioned above. I need to know if this change works for you as I can not verify the usnic BTL changes. I want to get this change approved and in this week so I can get the btl base atomics support finished.
@hjelmn Sorry for the delay. I ran the https://github.com/hjelmn/ompi/tree/btl_base_atomics_are_awesome branch last night and things generally look good. I'm getting some random failures, but I'm apparently getting some random failures with master, too. Specifically: with the btl_base_atomics_are_awesome branch, I'm seeing most (all?) of the one-sided tests that were failing are now passing.
We'll have to address the other random failures in another issue.
This is still causing terabytes of corefiles with Cisco's MTT that I regularly have to manually clear out to avoid filling my filesystem.
I'm going to have to disable Cisco MTT's one-sided testing until this is fixed.
I don't know if this is related to #7892 or not, but I figured I'd put in the cross-reference.
Austen said he has some time next week to take a look. Thanks!
osc/rdma will internally call MPI_Comm_split(MPI_COMM_TYPE_NODE)
to create a local group, and is getting some interesting/weird results. For example, I'm seeing the comm size on return be 3 when running 4 processes across 2 nodes, with 2 procs per node.
I have to think that https://github.com/open-mpi/ompi/issues/9010 is related.
@jsquyres @awlauria #9010 has been closed (some PMIx/PRRTE change, not root caused), so perhaps this issue also just needs to be retested?
It helps but I still run into issues.
Unfortunately I haven't had time to press on it much since the last update, and am out next week. Earliest I can give cycles on it would be the 28th.
oops, I accidentally clicked close, then re-opened it immediately.
FYI @bosilca
I plan to take a look this week.
@hjelmn Have you been able to reproduce?
@bosilca Did you give this a try? This may just be IBM setup? If so this wouldn't be a blocker.
Bad news. I run the onesided tests on one of our clusters with the current master (gf882930). They pass with the UCX OSC, and with RDMA OSC as long as only the sm BTL is in use. As soon as I spread the rank across multiple nodes all one-sided tests fail (basic two-sided test pass so my BTL selection is legit).
Same issue as described here? Where the comm split is wrong?
The communicator seems to be correct. I attached to one of the processes before it segfault, and the issue is in osc_rdma_lock.h, function ompi_osc_rdma_btl_fop, where the state_index is -1. The stack and peer variable do not look corrupted or uninitialized but both btl_index (state and data) of the peer object are set to 255 (because unsigned) and the selected_btl is then some memory location outside of the allocated array of BTLs.
@bosilca Can you elaborate on the error you are seeing? what type of btl? what test? This may be related to https://github.com/open-mpi/ompi/issues/8983.
Any test from the onesided directory in ompi-tests. BTL: tcp and sm. As I said above I see either segfaults or deadlocks.
FWIW: the Cisco MTT now only shows segfaults from onesided tests and from the datatype "pack" function. It is otherwise running pretty well. Strictly tcp and sm there.
I applied the patch from #9400 and rerun the tests. Similar story as before, segfaults and deadlocks. To give a precise example, here is how I run it:
$ salloc -N 2 -wa,b
$ mpirun -np 4 --map-by node --mca osc rdma --mca btl tcp,self -ca btl_tcp_if_include ib0 ./test_dan1
The test segfault in lock_unlock stage. However, the same test segfault when using UCX as well.
$ mpirun -np 4 --map-by node --mca pml ucx ./test_dan1
@bosilca I can reproduce the segfault when running test_dan1
, I will look into it.
@bosilca The segfault from the lock_unlock
stage in test_dan1
is because of locking type being two level. When locking type is two level, a process will first lock a global leader then lock the peer. The segfault happens when global leader is the process itself, in that case, because btl/tcp cannot do self communication, btl_index is 255, and selected_btl is NULL. adding --mca osc_rdma_locking_mode on_demand
fixed this segfault. If I understand correctly, two level locking is an optimization, not a mandate, so we should just disable two level locking when btl/tcp is used.
However, even with on_demand locking, I encountered numerous issues like hang, segfault, and data corruption when running other tests under ompi/onesided
.
I will make a list and keep looking into it.
Honestly let's just change the default to on demand locking in this case. The performance is not bad with on demand.
I just updated https://github.com/open-mpi/ompi/pull/9400 to address several new issues I found
The biggest issue continue to be that btl/tcp does not support self communication. To address this issue, I added https://github.com/open-mpi/ompi/pull/9400/commits/b0197d91d828401ba0b7b8593e026f2215cb70eb which uses btl/self for self communication (if selected btl does not support self communication).
To make it work, I had to do some refactor work , and make some change to btl/self, and change to osc/rdma.
Then there is another bug in osc/rdma in that the shm file name is not unique. I added https://github.com/open-mpi/ompi/pull/9400/commits/3c0ae03e48dd543ca941ac29e26953e4c89b2158 to address it.
With the current PR, most of the test under ompi-tests/onesided directory pass. except:
One negative test in t_winerror failed. The test was trying to free a non-existent window, and expect MPI_Win_free() to return MPI_WIN_ERROR, however, the error is considered fatal, so MPI aborted. I am not sure how (or should) we can fix it. Similar issue with test_start1.
lock4, get2 randomly hang on two nodes, I will continue investigate
put1, put4 data validation error on two nodes. I will continue investigate.
Catching up. Can btl/sm not be used? It allows self communication. When using tcp it should always also use btl/sm.
But I could be remembering incorrectly :).
Can btl/sm not be used?
Problem is people are running tests with the sm btl excluded (by specifying --mca btl self,tcp
).
I do not have the full background. I wonder if there is an agreement that btl/tcp must be used with btl/sm (for osc/rdma)?
@bosilca You're assigned to this, but are you working this?
https://github.com/open-mpi/ompi/pull/9696 will fix this issue too.
@jsquyres can we close this?
@jsquyres ping. can this be closed?
I think this is done.. @jsquyres if this is still an issue please re-open.
Cisco MTT shows that
MPI_WIN_CREATE
is failing since the OSC pt2pt component was removed (i.e.,MPI_WIN_CREATE
fails because there is no OSC component available). Here's an example of a simple test failing becauseMPI_WIN_CREATE
failed.After discussion on the 16 June 2020 weekly Open MPI webex, @hjelmn said that he could add a few functions to make OSC RDMA work in all cases (not just in cases where a single BTL can reach all members in the communicator group).
Marking this as a blocker for v5.0 because it's preventing MPI_WIN_CREATE from working in at least some environments, and because -- per the discussion between @hjelmn and @bwbarrett -- the fix to OSC RDMA doesn't seem too difficult.