Error:"Watchdog caught collective operation timeout" when finetuning MPT-7B on a local dataset using 2 A100 GPUs

satyaskada commented 1 year ago

Hi, I am trying to finetune the MPT-7B model using a local dataset on 2 A100 - 80GB GPUs. Below is the complete log. Torch Version: 1.13.1+cu117 Appreciate any help to resolve the issue.

/mpt-7b/llm-foundry/scripts/train# composer train.py yamls/finetune/mpt-7b_jokes.yaml Initializing model... Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: While attn_impl: triton can be faster than attn_impl: flash it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using attn_impl: flash if your model does not use alibi or prefix_lm. warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.') Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.42s/it] cfg.n_params=6.65e+09 Building train loader... Using pad_token, but it is not set yet. No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Building eval loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow Building trainer... Logging config... max_seq_len: 2048 global_seed: 17 run_name: mpt-7b-finetune model: name: hf_causal_lm pretrained: true pretrained_model_name_or_path: mosaicml/mpt-7b config_overrides: attn_config: attn_impl: triton attn_uses_sequence_id: false tokenizer: name: mosaicml/mpt-7b kwargs: model_max_length: ${max_seq_len} train_loader: name: finetuning dataset: hf_name: local_dataset split: train tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 eval_loader: name: finetuning dataset: hf_name: local_dataset split: test tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 scheduler: name: linear_decay_with_warmup t_warmup: 50ba alpha_f: 0 optimizer: name: decoupled_adamw lr: 5.0e-06 betas:

0.9
0.999 eps: 1.0e-08 weight_decay: 0 algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0 max_duration: 2ep eval_interval: 1ep eval_first: true global_train_batch_size: 8 seed: ${global_seed} device_eval_batch_size: 4 device_train_microbatch_size: 4 precision: amp_bf16 fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: true activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false progress_bar: false log_to_console: true console_log_interval: 1ba callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {} save_folder: ./{run_name}/checkpoints dist_timeout: 600.0 n_gpus: 2 device_train_batch_size: 4 device_train_grad_accum: 1 n_params: 6649286656

Config: node_name: unknown because NODENAME environment variable not set num_gpus_per_node: 2 num_nodes: 1 rank_zero_seed: 17

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. ERROR:composer.cli.launcher:Rank 1 crashed with exit code -6. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. Global rank 0 (PID 786) exited with code -6 Global rank 1 (PID 787) exited with code -6 ----------Begin global rank 1 STDOUT---------- Initializing model... cfg.n_params=6.65e+09 Building train loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building eval loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building trainer...

----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR---------- Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: While attn_impl: triton can be faster than attn_impl: flash it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend using attn_impl: flash if your model does not use alibi or prefix_lm. warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.')

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|████████████████████████████████████████████████████████████ | 1/2 [00:08<00:08, 8.54s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 4.94s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.48s/it] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out.

----------End global rank 1 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 786) exited with code -6 /mpt-7b/llm-foundry/scripts/train#

jquesnelle commented 1 year ago

I'm seeing the same thing... also using 2 A100 GPUs interestingly enough. I'm using torch 2.0.1 (default pip3 install torch)

NarenZen commented 1 year ago

I also have timeout error

yqli2420 commented 1 year ago

I encountered the same problem. Could you tell me how you resolved it?

eliaz commented 1 year ago

I also had this problem but it works after making this:export NCCL_P2P_LEVEL=NVL On a 4xA100 80GB,verified two times it solved the issue when present

jquesnelle commented 1 year ago

I've also had to go down to a 2x A100 setup because otherwise I run into the NCCL error, nothing larger seems to work

jquesnelle commented 1 year ago

I also had this problem but it works after making this:export NCCL_P2P_LEVEL=NVL On a 4xA100 80GB,verified two times it solved the issue when present

Were you on torch 1.13.1 or 2.0.1?

eliaz commented 1 year ago

root@ce755e977208:/llm-foundry# pip show torch
Name: torch
Version: 1.13.1+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/lib/python3/dist-packages
Requires: typing-extensions
Required-by: composer, flash-attn, llm-foundry, mosaicml-streaming, pytorch-ranger, torch-optimizer, torchmetrics, torchtext, torchvision, triton-pre-mlir

I'm running the docker "mosaicml/pytorch:latest"

It shows theese by default:

env|grep -i ncc
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
NCCL_VERSION=2.13.4-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_LIBNCCL_PACKAGE_NAME=libnccl2
NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1

Tried with two gpu's , then I dont run into this issue , but with four gpu's I do - and then it works when using the env variable. I get OOM error now but that's probably unrelated :)

alextrott16 commented 1 year ago

Are people still running into this problem?

sasaadi commented 1 year ago

I still have the same issue after setting export NCCL_P2P_LEVEL=NVL

alextrott16 commented 1 year ago

If you haven't already, can you try working off one of the recommended images in the top-level README, making sure that your code is up-to-date with the main branch, and re-installing to get all the latest dependencies? Basically, I'm wondering if this happens after following the install/set-up instructions in the README.

NCCL errors are notoriously hard to diagnose, so it'd be helpful to see if this is just an environment issue. But, honestly, there's not a lot to go off of here, so I can't make any promises.

sasaadi commented 1 year ago

I updated my code with the main branch and reinstalled the whole environment. It's working now... thanks

sasaadi commented 1 year ago

Follow-up on this error: when I use a larger local dataset, I get the error again when saving the checkpoint. Do you have any ideas on that? thanks

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 601762 milliseconds before timing out. .... .... composer.core.engine: Post-closing callback RuntimeEstimator [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. ERROR:composer.cli.launcher:Global rank 0 (PID 160339) exited with code -6

hithesh-sankararaman commented 1 year ago

Follow-up on this error: when I use a larger local dataset, I get the error again when saving the checkpoint. Do you have any ideas on that? thanks

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 601762 milliseconds before timing out. .... .... composer.core.engine: Post-closing callback RuntimeEstimator [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. ERROR:composer.cli.launcher:Global rank 0 (PID 160339) exited with code -6

I am also getting the same error while saving the model.

Tao-Cute commented 1 year ago

此错误的后续：当我使用较大的本地数据集时，保存检查点时再次收到错误。您对此有什么想法吗？谢谢运行时错误：NCCL 通信器在等级 0 上中止。失败的原始原因是：[等级 0] 看门狗捕获集体操作超时：WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) 在超时之前运行了 601762 毫秒。.... ....composer.core.engine：关闭后回调 RuntimeEstimator [E ProcessGroupNCCL.cpp:455] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性，后续 GPU 操作可能会在损坏/不完整的数据上运行。[E ProcessGroupNCCL.cpp:460] 为了避免数据不一致，我们将取消整个流程。错误：composer.cli.launcher：全局排名 0 (PID 160339) 退出，代码为 -6

保存模型时我也遇到同样的错误。

Im sure the problem comes from saving model on multi gpus，but i cant fix it。Do u fix it now？

dakinggg commented 1 year ago

We have fixed the issue with nccl timeouts at the end of runs while saving checkpoints. I have a feeling there may be a variety of issues in this thread at this point, so I am going to close this issue. Please open a new issue if you are still encountering problems.

mosaicml / llm-foundry

Error:"Watchdog caught collective operation timeout" when finetuning MPT-7B on a local dataset using 2 A100 GPUs #203