oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
185 stars 66 forks source link

Allreduce cpu example fails with CCL_WORKER_COUNT > 1 #109

Open piotrchmiel opened 6 months ago

piotrchmiel commented 6 months ago

I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .

I modified it slightly by increasing the buffer size 100 times:

diff --git a/examples/cpu/cpu_allreduce_test.cpp b/examples/cpu/cpu_allreduce_test.cpp
index 6e9ac4d..5dfe2d9 100644
--- a/examples/cpu/cpu_allreduce_test.cpp
+++ b/examples/cpu/cpu_allreduce_test.cpp
@@ -22,7 +22,7 @@
 using namespace std;

 int main() {
-    const size_t count = 4096;
+    const size_t count = 4096*100;

     size_t i = 0;

When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:

piotrc@machine:~/ws/oneCCL/build$ CCL_WORKER_COUNT=2 mpirun -np 2 examples/cpu/cpu_allreduce_test
[1705415958.879795729] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
[1705415958.879801821] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 559315 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 559316 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

With CCL_WORKER_COUNT=1 it works perfect.

piotrc@machine:~/ws/oneCCL/build$ mpirun -np 2 examples/cpu/cpu_allreduce_test
PASSED

What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.

Attached CCL_LOG_LEVEL=info logs.txt Attached CCL_LOG_LEVEL=debug logs_debug.txt

piotrchmiel commented 6 months ago

Possible workaround:

FI_PROVIDER=verbs CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED

FI_PROVIDER=tcp CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED

nikitaxgusev commented 6 months ago

@piotrchmiel Hi. Your fi_info should say that psm3 is available for you, do you see that? Please execute it and check. https://github.com/oneapi-src/oneCCL/tree/master/deps/ofi/bin Can you please give a hint how do you compile oneccl?

yao-matrix commented 3 months ago

@piotrchmiel , you can try this. echo 0 > /proc/sys/kernel/yama/ptrace_scope.