Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.25k stars 534 forks source link

Open lijinf2 opened 9 hours ago

lijinf2 commented 9 hours ago

The hanging occurs when one GPU gets a sparse matrix of all zero values, while other GPUs get-zero values.