rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

Ensure MG to have the same number of allreduce calls in mean_stddev for sparse matrix to avoid hanging #6141

Open lijinf2 opened 9 hours ago

lijinf2 commented 9 hours ago

The hanging occurs when one GPU gets a sparse matrix of all zero values, while other GPUs get-zero values.