NCCL collective operation timeout

When I use a large dataset (~10B tokens), I encounter an NCCL Timeout error.

Here is part of log:

[default0]:Grouping texts in chunks of 2049:   8%|▊         | 726000/9672101 [02:32<30:38, 4866.01 examples/s][default2]:[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught 
collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600888 milliseconds before timing out.                   
[default3]:[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) 
ran for 600892 milliseconds before timing out.

huggingface / nanotron

NCCL collective operation timeout #204