open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

UCX seemingly causing Python MPI tests to fail with Open MPI v4.0.5 #6777

Open jsquyres opened 5 years ago

jsquyres commented 5 years ago

As reported on the mpi4py bitbucket, it looks like enabling UCX support in Open MPI v4.0.1 in Fedora 30 is causing some mpi4py tests to fail.

See the link above for more details, but the short version is:

@jladd-mlnx @artpol84 Can someone from Mellanox look into this?

FYI: @dalcinl @opoplawski

artpol84 commented 5 years ago

@jsquyres

openmpi-4.0.1-1.fc31.x86_64 seems to be older than openmpi-4.0.1-5.fc31.x86_64. Can this indicate that the problem was fixed?

artpol84 commented 5 years ago

FYI: @yosefe

jsquyres commented 5 years ago

@artpol84 No, please re-read my summary and/or the entire thread.

artpol84 commented 5 years ago

He disabled UCX in the -5 version, which enabled the tests to work.

@artpol84 No, please re-read my summary and/or the entire thread.

@jsquyres, thanks. I missed it.

jsquyres commented 5 years ago

@artpol84 Can Mellanox check to see if this is now fixed with Open MPI v4.0.2?

artpol84 commented 5 years ago

@jsquyres I don't expect it to be fixed as of now. UCX doesn't support 1B and 2B atomics. We are planning to fix it in the near future, but it is not yet fixed. @janjust, @jladd-mlnx, please correct me if I am wrong.

opoplawski commented 4 years ago

I suspect that this was resolved at some point. At least, mpi4py test_rma is no longer failing on Fedora Rawhide.

jsquyres commented 4 years ago

@artpol84 Since you're planning to support it in the near future, let's leave this open to track it.

hppritcha commented 3 years ago

@artpol84 does UCX support 1 and 2 byte atomics now?

yosefe commented 3 years ago

@artpol84 does UCX support 1 and 2 byte atomics now?

It does not

hppritcha commented 3 years ago

closing as no longer being observed by @opoplawski

dalcinl commented 3 years ago

@hppritcha I believe the issues are not fully resolved yet. I'm running Fedora 33, with openmpi-4.0.5-1.fc33.x86_64. The following test run is with mpi4py/master.

$ mpiexec -n 1 python test/runtests.py --no-threads -v -i rma$ TestRMASelf
[0@optiplex] Python 3.9 (/usr/bin/python)
[0@optiplex] MPI 3.1 (Open MPI 4.0.5)
[0@optiplex] mpi4py 3.1.0a0 (/home/dalcinl/Devel/mpi4py-dev/build/lib.linux-x86_64-3.9/mpi4py)
testAccumulate (test_rma.TestRMASelf) ... ok
testAccumulateProcNullReplace (test_rma.TestRMASelf) ... ok
testAccumulateProcNullSum (test_rma.TestRMASelf) ... ok
testCompareAndSwap (test_rma.TestRMASelf) ... [1617082714.741125] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741161] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741172] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741179] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741203] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741211] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741216] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741362] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741378] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741388] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741415] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741428] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741437] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741593] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741606] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741614] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741646] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741656] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741678] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741849] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741882] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741895] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.741936] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741951] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.741964] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
ok
testFence (test_rma.TestRMASelf) ... ok
testFenceAll (test_rma.TestRMASelf) ... ok
testFetchAndOp (test_rma.TestRMASelf) ... [1617082714.742394] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742406] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742412] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742443] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742449] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742453] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742459] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742464] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742468] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742507] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742514] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742533] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742562] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742568] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742572] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742594] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742600] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742606] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.742966] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742982] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.742992] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743084] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743109] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743119] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743128] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743137] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743158] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743201] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743225] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743234] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743312] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743322] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743331] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743341] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743351] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743360] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.743815] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743826] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743834] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743873] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743880] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743888] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743895] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743902] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743910] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.743965] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743977] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.743985] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744054] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744062] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744069] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744076] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744082] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744089] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744165] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744174] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744182] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744239] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744247] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744255] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744278] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744286] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744293] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.744805] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744837] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744849] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744919] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744930] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744941] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744953] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744964] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.744975] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 8
[1617082714.745031] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745045] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745056] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745141] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745152] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745163] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745175] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745186] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745197] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 16
[1617082714.745815] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745827] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745836] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745887] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745896] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745905] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745914] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745923] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
[1617082714.745931] [optiplex:437360:0]       amo_send.c:175  UCX  ERROR invalid atomic operation datatype: 128
ok
testFlush (test_rma.TestRMASelf) ... ok
testGetAccumulate (test_rma.TestRMASelf) ... ok
testGetAccumulateProcNull (test_rma.TestRMASelf) ... ok
testGetProcNull (test_rma.TestRMASelf) ... ok
testPostWait (test_rma.TestRMASelf) ... ok
testPutGet (test_rma.TestRMASelf) ... ok
testPutProcNull (test_rma.TestRMASelf) ... ok
testStartComplete (test_rma.TestRMASelf) ... ok
testStartCompletePostTest (test_rma.TestRMASelf) ... ok
testStartCompletePostWait (test_rma.TestRMASelf) ... ok
testSync (test_rma.TestRMASelf) ... ok

----------------------------------------------------------------------
Ran 18 tests in 0.089s

OK
yosefe commented 3 years ago

seems MPI is trying to use 1 byte and 2 byte atomics, which is not supported by UCX

hoopoepg commented 3 years ago

and 16 byte datatypes

dalcinl commented 3 years ago

Yes, I have a test that loops over datatypes and performs CompareAndSwap and FetchAndOp. Isn't that a reasonable test? IMHO, If an MPI implementation cannot support the operation for some datatypes, it should barf with an error.

yosefe commented 3 years ago

and 16 byte datatypes

datatype=16 is 2-byte contig

dalcinl commented 3 years ago

datatype=16 is 2-byte contig

And datatype=128 is 16-byte contig, right?

yosefe commented 3 years ago

And datatype=128 is 16-byte contig, right?

right osc/ucx should fallback to active messages if datatype is not 4/8 bytes

gpaulsen commented 3 years ago

@dalcinl What version of UCX are you using with Fedora 33?

dalcinl commented 3 years ago

@gpaulsen These are the current openmpi and ucx packages in my Fedora 33:

$ rpm -qa | egrep "(ucx|openmpi)"
openmpi-4.0.5-1.fc33.x86_64
openmpi-devel-4.0.5-1.fc33.x86_64
ucx-1.9.0-1.fc33.x86_64
hjelmn commented 3 years ago

Looking at that log osc/ucx should not be in use. It should be losing to osc/rdma when not using a mellanox HCA. In the failure case NP=1 which I don't think should be using UCX ever.

Doesn't address the issue that osc/ucx is doing the wrong thing (it is) but does indicate that the version of Open MPI is using the wrong components by default.

dalcinl commented 3 years ago

@hjelmn mpi4py initializes MPI with THREAD_MULTIPLE. Perhaps that is affecting component selection?

thangckt commented 1 year ago

I have a problem " Caught signal 11 (Segmentation fault: address not mapped to object at address" when run a Python code using OpenMPI with UCX. When I disable UCX, the code can run without any error.

Does anyone know why? Or any hint that I can try to void this error?