Open piotrchmiel opened 6 months ago
Possible workaround:
FI_PROVIDER=verbs CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED
FI_PROVIDER=tcp CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED
@piotrchmiel Hi. Your fi_info should say that psm3 is available for you, do you see that? Please execute it and check. https://github.com/oneapi-src/oneCCL/tree/master/deps/ofi/bin Can you please give a hint how do you compile oneccl?
@piotrchmiel , you can try this. echo 0 > /proc/sys/kernel/yama/ptrace_scope
.
I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .
I modified it slightly by increasing the buffer size 100 times:
When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:
With CCL_WORKER_COUNT=1 it works perfect.
What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.
Attached CCL_LOG_LEVEL=info logs.txt Attached CCL_LOG_LEVEL=debug logs_debug.txt