Timeout at validation step

System Info

Package                  Version      Editable project location
------------------------ ------------ ---------------------------------------------------------------------
accelerate               1.0.1
aiohappyeyeballs         2.4.3
aiohttp                  3.10.10
aiosignal                1.3.1
annotated-types          0.7.0
antlr4-python3-runtime   4.9.3
asttokens                2.4.1
async-timeout            4.0.3
attrs                    24.2.0
backcall                 0.2.0
blobfile                 3.0.0
certifi                  2024.8.30
charset-normalizer       3.4.0
click                    8.1.7
contourpy                1.3.0
cycler                   0.12.1
datasets                 3.0.2
debugpy                  1.6.7
decorator                5.1.1
deepspeed                0.15.3
dill                     0.3.8
docker-pycreds           0.4.0
executing                2.1.0
filelock                 3.16.1
fonttools                4.54.1
frozenlist               1.4.1
fsspec                   2024.9.0
gitdb                    4.0.11
GitPython                3.1.43
hjson                    3.1.0
huggingface-hub          0.26.1
idna                     3.10
importlib_metadata       8.5.0
importlib_resources      6.4.5
ipykernel                6.14.0
ipython                  8.4.0
jedi                     0.19.1
Jinja2                   3.1.3
jupyter_client           8.6.3
jupyter_core             5.7.2
kiwisolver               1.4.7
lxml                     5.3.0
MarkupSafe               2.1.5
matplotlib               3.9.2
matplotlib-inline        0.1.7
mpmath                   1.3.0
msgpack                  1.1.0
multidict                6.1.0
multiprocess             0.70.16
nest_asyncio             1.6.0
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    2.0.2
nvidia-cublas-cu11       11.11.3.6
nvidia-cuda-cupti-cu11   11.8.87
nvidia-cuda-nvrtc-cu11   11.8.89
nvidia-cuda-runtime-cu11 11.8.89
nvidia-cudnn-cu11        9.1.0.70
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.3.0.86
nvidia-cusolver-cu11     11.4.1.48
nvidia-cusparse-cu11     11.7.5.86
nvidia-nccl-cu11         2.20.5
nvidia-nvtx-cu11         11.8.86
omegaconf                2.3.0
packaging                24.1
pandas                   2.2.3
parso                    0.8.4
peft                     0.13.2
pexpect                  4.9.0
pickleshare              0.7.5
pillow                   10.2.0
pip                      24.2
platformdirs             4.3.6
prompt_toolkit           3.0.48
propcache                0.2.0
protobuf                 5.28.3
psutil                   5.9.8
ptyprocess               0.7.0
pure_eval                0.2.3
py-cpuinfo               9.0.0
pyarrow                  17.0.0
pycryptodomex            3.21.0
pydantic                 2.9.2
pydantic_core            2.23.4
Pygments                 2.18.0
pyparsing                3.2.0
python-dateutil          2.9.0
pytz                     2024.2
PyYAML                   6.0.2
pyzmq                    24.0.1
regex                    2024.9.11
requests                 2.32.3
safetensors              0.4.5
seaborn                  0.13.2
sentencepiece            0.2.0
sentry-sdk               2.17.0
setproctitle             1.3.3
setuptools               75.1.0
six                      1.16.0
smmap                    5.0.1
stack-data               0.6.2
sympy                    1.13.1
tiktoken                 0.8.0
tokenizers               0.19.1
torch                    2.4.1+cu118
torchao                  0.6.1
torchaudio               2.4.1+cu118
torchtune                0.3.1
torchvision              0.19.1+cu118
tornado                  6.4.1
tqdm                     4.66.5
traitlets                5.14.3
transformers             4.45.0
triton                   3.0.0
typing_extensions        4.12.2
tzdata                   2024.2
urllib3                  2.2.3
wandb                    0.18.5
wcwidth                  0.2.13
wheel                    0.44.0
xxhash                   3.5.0
yarl                     1.16.0
zipp                     3.20.2

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

This is my training loop

for epoch in range(1, args.num_epochs + 1):
        start_time = perf_counter()

        model.train()
        train_loss = 0

        for idx, batch in enumerate(tqdm(train_dataloader, disable=args.disable_tqdm)):
            inputs = tokenizer(batch['text'], padding="longest", truncation=True, max_length=2200, return_tensors='pt', return_token_type_ids=False).to(device)

            if (inputs['attention_mask'] == 0).any():
                print("Skipping batch due to presence of padding.")
                continue 

            inputs['labels'] = inputs['input_ids'].clone()

            label_mask = inputs['attention_mask'].bool()
            inputs['labels'][~label_mask] = -100

            loss = model(**inputs).loss

            accelerator.backward(loss)
            # if accelerator.sync_gradients:
            #     accelerator.clip_grad_value_(model.parameters(), args.max_norm)

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

            gathered_loss = accelerator.gather(loss)
            train_mean_loss = gathered_loss.mean().item()
            if accelerator.is_main_process:
                wandb.log({"train_loss": train_mean_loss})
                # To do, gpu util logging하기
                # log_gpu_metrics()

            if (idx+1)%100==0: # headmap 그리기
                layer_params = get_layer_parameters(model)
                plot_beta_heatmaps(layer_params, idx+1)
                if accelerator.is_main_process:
                    layer_params = get_layer_parameters(model)
                    fig_file_name = plot_beta_heatmaps(layer_params, idx+1)
                    wandb.log({"heatmap": wandb.Image(fig_file_name)})

            if (idx+1)%10==0: # beta 변화도 그리기
                if accelerator.is_main_process:
                    layer_params = get_layer_parameters(model)
                    beta_values = np.array([param['beta'] for name, param in layer_params.items()])
                    gating_values = 1 / (1 + np.exp(-beta_values)) 
                    wandb.log({f'gating_weight_layer_{j}_head_{i}':(gating_values[j][i]) for i in range(4) for j in range(6)})

            if (idx+1)%500==0: # 500 step마다 valid
                ###################### valid ######################
                model.eval()
                valid_loss = 0
                for batch in tqdm(eval_dataloader, disable=args.disable_tqdm):
                    inputs = tokenizer(batch['text'], padding=True, truncation=True, max_length=2200, return_tensors='pt', return_token_type_ids=False).to(device)
                    inputs['labels'] = inputs['input_ids'].clone()

                    label_mask = inputs['attention_mask'].bool()
                    inputs['labels'][~label_mask] = -100

                    with torch.no_grad():
                        loss = model(**inputs).loss
                    gathered_loss = accelerator.gather(loss)
                    mean_loss = gathered_loss.mean().item()
                    valid_loss += mean_loss

                valid_loss /= len(eval_dataloader)
                if accelerator.is_main_process:
                    wandb.log({"valid_loss": valid_loss})
                end_time = perf_counter()
                elapsed_time = get_elapsed_time(start_time, end_time)

                print(f'[Step {idx:2}/{len(train_dataloader)}] Train Loss: {train_mean_loss:6.4f} | Valid Loss: {valid_loss:6.4f} | {elapsed_time}')
                ###################### valid ######################

Expected behavior

I am fine-tuning the Llama3 8B model using DeepSpeed Zero Stage 2 with the accelerator library on two A100 80GB GPUs on the slurm cluster. I set validation to run every 500 steps, but I encounter timeout error below at the 500-step mark, causing the training to terminate. Notably, the validation dataset is very small, with only 26 examples.

However, when I reduce the validation interval to every 2 steps, this timeout error does not occur. Occasionally, the timeout error also appears at 200 steps.

What could be causing this issue and are there any recommendations or workarounds to resolve this? Thank you for your assistance!

7%|▋         | 499/7151 [3:38:58<49:46:17, 26.94s/it]
7%|▋         | 496/7151 [3:38:58<46:22:13, 25.08s/it]
7%|▋         | 497/7151 [3:39:23<46:27:09, 25.13s/it]
 0%|          | 0/13 [00:00<?, ?it/s][AWe detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/internal/generation_utils#transformers.Cache)
[rank0]:[E1102 03:04:14.692188080 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=129457, OpType=ALLREDUCE, NumelIn=525340672, NumelOut=525340672, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
[rank0]:[E1102 03:04:14.692510378 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 129457, last enqueued NCCL work: 129471, last completed NCCL work: 129456.
[rank0]:[E1102 03:04:24.534942920 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 0] Timeout at NCCL work: 129457, last enqueued NCCL work: 129471, last completed NCCL work: 129456.
[rank0]:[E1102 03:04:24.534972800 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1102 03:04:24.534978580 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1102 03:04:24.540499422 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=129457, OpType=ALLREDUCE, NumelIn=525340672, NumelOut=525340672, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa9de4aff86 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa9df77ddb2 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fa9df7847f3 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa9df786bdc in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7faa30ca4b65 in /home/qmin2/anaconda3/envs/infini/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7faa3f6a51cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7faa3f310e73 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=129457, OpType=ALLREDUCE, NumelIn=525340672, NumelOut=525340672, Timeout(ms)=600000) ran for 600002 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa9de4aff86 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa9df77ddb2 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fa9df7847f3 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa9df786bdc in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7faa30ca4b65 in /home/qmin2/anaconda3/envs/infini/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81cf (0x7faa3f6a51cf in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7faa3f310e73 in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fa9de4aff86 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe2de06 (0x7fa9df411e06 in /home/qmin2/anaconda3/envs/infini/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b65 (0x7faa30ca4b65 in /home/qmin2/anaconda3/envs/infini/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x81cf (0x7faa3f6a51cf in /lib64/libpthread.so.0)
frame #4: clone + 0x43 (0x7faa3f310e73 in /lib64/libc.so.6)

W1102 03:04:34.767204 140299449177920 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1751935 closing signal SIGTERM
E1102 03:04:42.146098 140299449177920 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 1751934) of binary: /home/qmin2/anaconda3/envs/infini/bin/python
Warning: The cache directory for DeepSpeed Triton autotune, /home/qmin2/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Traceback (most recent call last):

huggingface / accelerate