huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

NCCL collective operation timeout #204

Closed heya5 closed 4 months ago

heya5 commented 4 months ago

When I use a large dataset (~10B tokens), I encounter an NCCL Timeout error.

Here is part of log:

[default0]:Grouping texts in chunks of 2049:   8%|▊         | 726000/9672101 [02:32<30:38, 4866.01 examples/s][default2]:[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught 
collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600888 milliseconds before timing out.                   
[default3]:[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) 
ran for 600892 milliseconds before timing out.
xrsrke commented 4 months ago

Hello, we recommend using nanoset to tokenize the dataset before run the training script. So no waiting timeouts due to tokenization