openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 423 forks source link

AMO tests fail in GTEST on Ubuntu 22.04 #8717

Open shamisp opened 1 year ago

shamisp commented 1 year ago

Describe the bug

Gtest unit tests fail

Steps to Reproduce

make -C test/gtest test GTEST_FILTER="*shm_ib/test_ucp_atomic32.post/*"

Setup and versions

List of failed tests:

[  PASSED  ] 25966 tests.
[  FAILED  ] 16 tests, listed below:
[  FAILED  ] shm_ib/test_ucp_atomic32.post/4, where GetParam() = shm,ib,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib/test_ucp_atomic32.post/5, where GetParam() = shm,ib,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib/test_ucp_atomic32.fetch/4, where GetParam() = shm,ib,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib/test_ucp_atomic32.fetch/5, where GetParam() = shm,ib,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic32.post/4, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic32.post/5, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic32.fetch/4, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic32.fetch/5, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib/test_ucp_atomic64.post/4, where GetParam() = shm,ib,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib/test_ucp_atomic64.post/5, where GetParam() = shm,ib,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib/test_ucp_atomic64.fetch/4, where GetParam() = shm,ib,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib/test_ucp_atomic64.fetch/5, where GetParam() = shm,ib,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic64.post/4, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic64.post/5, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/guess/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic64.fetch/4, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/device/proto
[  FAILED  ] shm_ib_ipc/test_ucp_atomic64.fetch/5, where GetParam() = shm,ib,cuda_ipc,rocm_ipc,cuda_copy,rocm_copy/guess/proto

All the failures look very similar, this is example of one:

data validation failed
[1669051418.378374] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[1669051418.378395] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[     INFO ] host->host ADD AND OR XOR
[1669051418.378423] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[1669051418.378448] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[1669051418.378469] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[1669051418.378488] [WS-RTX:6201 :0]          amo_sw.c:227  UCX  ERROR Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:103: Failure
Failed
data validation failed
[     INFO ] host->host ADD AND OR XOR
/home/XXXX/YYYY/YYYY/ucx/contrib/../test/gtest/common/test.cc:366: Failure
Failed
Got 107 errors and 0 warnings during the test
[     INFO ] < /home/XXXX/YYYY/YYYY/ucx/contrib/../src/ucp/rma/amo_sw.c:227 Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010 >
[     INFO ] < /home/XXXX/YYYY/YYYY/ucx/contrib/../src/ucp/rma/amo_sw.c:227 Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010 >
[     INFO ] < /home/XXXX/YYYY/YYYY/ucx/contrib/../src/ucp/rma/amo_sw.c:227 Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010 >
[     INFO ] < /home/XXXX/YYYY/YYYY/ucx/contrib/../src/ucp/rma/amo_sw.c:227 Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010 >
[     INFO ] < /home/XXXX/YYYY/YYYY/ucx/contrib/../src/ucp/rma/amo_sw.c:227 Unsupported: got software atomic request while device atomics are selected on worker 0x55a294f8f010 >
[  FAILED  ] shm_ib/test_ucp_atomic32.post/5, where GetParam() = shm,ib,cuda_copy,rocm_copy/guess/proto (204 ms)
[----------] 6 tests from shm_ib/test_ucp_atomic32 (1088 ms total)
shamisp commented 1 year ago

I did a bit more debug on the issue. One of the unique things about the config is the fact that the IB port is running in SDR mode (due to specific cable config). I switched the port to ROCE / HDR and the error disappears. I'm pretty sure the issue is triggered by SDR port.

yosefe commented 1 year ago

@shamisp can you pls post "ibv_devinfo -vv" output when the issue happens?