xgboost's live-lock still occurs(I hoped it fixed in a new version). Looks like it's xgboost issue https://github.com/dmlc/xgboost/issues/4107 Merged a temporary fix. Deadlock can still occur on CUDA 8(know issue of NCCL)
Added workaround(cache build in docker cache) to not compile NCCL every time. Another option could be to store it in s3, but that introduces a dependency on H2O infrastructure.
@sh1ng Latest NCCL doesn't solve the problem, I just tried to compile NCCL from source. I will workaround the problem by using old Group* APIs of NCCL in XGBoost.
xgboost's live-lock still occurs(I hoped it fixed in a new version). Looks like it's xgboost issue https://github.com/dmlc/xgboost/issues/4107Merged a temporary fix. Deadlock can still occur on CUDA 8(know issue of NCCL)