Closed devreal closed 2 years ago
@devreal which HCOLL version? Is this running on multiple nodes?
HCOLL version is 4.6.3125-fc613b79
and I'm running on a single node.
@devreal regarding the for the giggles experiment, if you set -mca coll hcoll,libnbc,basic
- does it still fail? I get a failure too with -mca hcoll
, but setting -mca coll hcoll,libnbc,basic
works, partially because hcoll doesn't support all collectives, shouldn't segfault though.
As far as reproducer - yep, I can reproduce, we'll look into it Thanks for reporting
@janjust Thanks for looking into this. I can confirm that setting -mca coll hcoll,libnbc,basic
avoids the assertion.
@janjust looks like we incorrectly map OPAL_DATATYPE_LONG to DTE_UNT32 in mca/coll/hcoll/coll_hcoll_dtypes.h (it should be DTE_INT32 - signed). -1 in the test must get casted to UNT32_MAX, hence wrong result.
@vspetrov why 32, shouldn't a long be 64? ah I see now, for cases when long is 4
I'll push a PR shortly
I think we have it incorrect for both branches (sizeof_long == 4 and == 8)
Closing - PR's merged.
I'm tracking down failures in the ARMCI test suite occuring with osc/rdma that are similar to https://github.com/open-mpi/ompi/issues/10328. I found that on my system hcoll is broken and does not properly support
MPI_MIN
, which breaks the detection ofsame_size
andsame_disp
in osc/rdma. Here is a test case:The output:
The expected output is
0 -1
on both processes. I get the correct result if I disable hcoll:If I replace
MPI_MIN
withMPI_SUM
the output is correct:I should mention that I am seeing the following warning, which points out potential performance issues but does not hint at correctness issues:
If I set
-x HCOLL_RCACHE=^ucs
the warning disappears but the result stays incorrect.I built a current Open MPI main branch against UCX 1.11.2 provided by the system. The UCX configuration is:
and
For the giggles: There seems to be another bug that occurs if I enable hcoll as the only collective implementation (on a debug build):