Closed heya5 closed 4 months ago
When I use a large dataset (~10B tokens), I encounter an NCCL Timeout error.
Here is part of log:
[default0]:Grouping texts in chunks of 2049: 8%|▊ | 726000/9672101 [02:32<30:38, 4866.01 examples/s][default2]:[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600888 milliseconds before timing out. [default3]:[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600892 milliseconds before timing out.
Hello, we recommend using nanoset to tokenize the dataset before run the training script. So no waiting timeouts due to tokenization
When I use a large dataset (~10B tokens), I encounter an NCCL Timeout error.
Here is part of log: