Open jsquyres opened 5 years ago
@jsquyres
openmpi-4.0.1-1.fc31.x86_64
seems to be older than openmpi-4.0.1-5.fc31.x86_64
. Can this indicate that the problem was fixed?
FYI: @yosefe
@artpol84 No, please re-read my summary and/or the entire thread.
He disabled UCX in the -5 version, which enabled the tests to work.
@artpol84 No, please re-read my summary and/or the entire thread.
@jsquyres, thanks. I missed it.
@artpol84 Can Mellanox check to see if this is now fixed with Open MPI v4.0.2?
@jsquyres I don't expect it to be fixed as of now. UCX doesn't support 1B and 2B atomics. We are planning to fix it in the near future, but it is not yet fixed. @janjust, @jladd-mlnx, please correct me if I am wrong.
I suspect that this was resolved at some point. At least, mpi4py test_rma is no longer failing on Fedora Rawhide.
@artpol84 Since you're planning to support it in the near future, let's leave this open to track it.
@artpol84 does UCX support 1 and 2 byte atomics now?
@artpol84 does UCX support 1 and 2 byte atomics now?
It does not
closing as no longer being observed by @opoplawski
@hppritcha I believe the issues are not fully resolved yet. I'm running Fedora 33, with openmpi-4.0.5-1.fc33.x86_64. The following test run is with mpi4py/master.
$ mpiexec -n 1 python test/runtests.py --no-threads -v -i rma$ TestRMASelf
[0@optiplex] Python 3.9 (/usr/bin/python)
[0@optiplex] MPI 3.1 (Open MPI 4.0.5)
[0@optiplex] mpi4py 3.1.0a0 (/home/dalcinl/Devel/mpi4py-dev/build/lib.linux-x86_64-3.9/mpi4py)
testAccumulate (test_rma.TestRMASelf) ... ok
testAccumulateProcNullReplace (test_rma.TestRMASelf) ... ok
testAccumulateProcNullSum (test_rma.TestRMASelf) ... ok
testCompareAndSwap (test_rma.TestRMASelf) ... [1617082714.741125] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741161] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741172] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741179] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741203] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741211] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741216] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741362] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741378] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741388] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741415] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741428] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741437] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741593] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741606] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741614] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741646] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741656] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741678] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741849] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741882] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741895] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.741936] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741951] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.741964] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
ok
testFence (test_rma.TestRMASelf) ... ok
testFenceAll (test_rma.TestRMASelf) ... ok
testFetchAndOp (test_rma.TestRMASelf) ... [1617082714.742394] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742406] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742412] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742443] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742449] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742453] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742459] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742464] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742468] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742507] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742514] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742533] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742562] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742568] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742572] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742594] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742600] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742606] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.742966] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742982] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.742992] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743084] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743109] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743119] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743128] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743137] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743158] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743201] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743225] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743234] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743312] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743322] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743331] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743341] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743351] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743360] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.743815] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743826] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743834] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743873] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743880] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743888] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743895] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743902] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743910] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.743965] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743977] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.743985] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744054] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744062] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744069] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744076] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744082] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744089] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744165] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744174] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744182] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744239] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744247] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744255] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744278] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744286] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744293] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.744805] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744837] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744849] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744919] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744930] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744941] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744953] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744964] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.744975] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 8
[1617082714.745031] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745045] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745056] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745141] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745152] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745163] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745175] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745186] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745197] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 16
[1617082714.745815] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745827] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745836] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745887] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745896] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745905] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745914] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745923] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
[1617082714.745931] [optiplex:437360:0] amo_send.c:175 UCX ERROR invalid atomic operation datatype: 128
ok
testFlush (test_rma.TestRMASelf) ... ok
testGetAccumulate (test_rma.TestRMASelf) ... ok
testGetAccumulateProcNull (test_rma.TestRMASelf) ... ok
testGetProcNull (test_rma.TestRMASelf) ... ok
testPostWait (test_rma.TestRMASelf) ... ok
testPutGet (test_rma.TestRMASelf) ... ok
testPutProcNull (test_rma.TestRMASelf) ... ok
testStartComplete (test_rma.TestRMASelf) ... ok
testStartCompletePostTest (test_rma.TestRMASelf) ... ok
testStartCompletePostWait (test_rma.TestRMASelf) ... ok
testSync (test_rma.TestRMASelf) ... ok
----------------------------------------------------------------------
Ran 18 tests in 0.089s
OK
seems MPI is trying to use 1 byte and 2 byte atomics, which is not supported by UCX
and 16 byte datatypes
Yes, I have a test that loops over datatypes and performs CompareAndSwap and FetchAndOp. Isn't that a reasonable test? IMHO, If an MPI implementation cannot support the operation for some datatypes, it should barf with an error.
and 16 byte datatypes
datatype=16 is 2-byte contig
datatype=16 is 2-byte contig
And datatype=128 is 16-byte contig, right?
And datatype=128 is 16-byte contig, right?
right osc/ucx should fallback to active messages if datatype is not 4/8 bytes
@dalcinl What version of UCX are you using with Fedora 33?
@gpaulsen These are the current openmpi and ucx packages in my Fedora 33:
$ rpm -qa | egrep "(ucx|openmpi)"
openmpi-4.0.5-1.fc33.x86_64
openmpi-devel-4.0.5-1.fc33.x86_64
ucx-1.9.0-1.fc33.x86_64
Looking at that log osc/ucx should not be in use. It should be losing to osc/rdma when not using a mellanox HCA. In the failure case NP=1 which I don't think should be using UCX ever.
Doesn't address the issue that osc/ucx is doing the wrong thing (it is) but does indicate that the version of Open MPI is using the wrong components by default.
@hjelmn mpi4py initializes MPI with THREAD_MULTIPLE. Perhaps that is affecting component selection?
I have a problem " Caught signal 11 (Segmentation fault: address not mapped to object at address" when run a Python code using OpenMPI with UCX. When I disable UCX, the code can run without any error.
Does anyone know why? Or any hint that I can try to void this error?
As reported on the mpi4py bitbucket, it looks like enabling UCX support in Open MPI v4.0.1 in Fedora 30 is causing some mpi4py tests to fail.
See the link above for more details, but the short version is:
MPI_SIGNED_CHAR
andMPI_SHORT
.@jladd-mlnx @artpol84 Can someone from Mellanox look into this?
FYI: @dalcinl @opoplawski