openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 425 forks source link

FLAG_RNDV_FRAG assertion failure with cuda transfers within the node #5646

Closed Akshay-Venkatesh closed 4 years ago

Akshay-Venkatesh commented 4 years ago

Describe the bug

When rc is removed from UCX_TLS to move data between GPUs that are not cuda-ipc accessible, the above assertion issue shows up. This happened when trying to see if https://github.com/openucx/ucx/pull/5473 addresses https://github.com/openucx/ucx/issues/3249

Steps to Reproduce

cc @bureddy

bureddy commented 4 years ago

@Akshay-Venkatesh will check.

Akshay-Venkatesh commented 4 years ago

@bureddy thanks for the fix.

5675 does fix the above assertion failure. I've verified with osu microbenchmarks 5.6.3