Closed satyaskada closed 1 year ago
I'm seeing the same thing... also using 2 A100 GPUs interestingly enough. I'm using torch 2.0.1 (default pip3 install torch
)
I also have timeout error
I encountered the same problem. Could you tell me how you resolved it?
I also had this problem but it works after making this:export NCCL_P2P_LEVEL=NVL
On a 4xA100 80GB,verified two times it solved the issue when present
I've also had to go down to a 2x A100 setup because otherwise I run into the NCCL error, nothing larger seems to work
I also had this problem but it works after making this:
export NCCL_P2P_LEVEL=NVL
On a 4xA100 80GB,verified two times it solved the issue when present
Were you on torch 1.13.1
or 2.0.1
?
root@ce755e977208:/llm-foundry# pip show torch
Name: torch
Version: 1.13.1+cu117
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/lib/python3/dist-packages
Requires: typing-extensions
Required-by: composer, flash-attn, llm-foundry, mosaicml-streaming, pytorch-ranger, torch-optimizer, torchmetrics, torchtext, torchvision, triton-pre-mlir
I'm running the docker "mosaicml/pytorch:latest"
It shows theese by default:
env|grep -i ncc
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
NCCL_VERSION=2.13.4-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NV_LIBNCCL_PACKAGE_NAME=libnccl2
NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
Tried with two gpu's , then I dont run into this issue , but with four gpu's I do - and then it works when using the env variable. I get OOM error now but that's probably unrelated :)
Are people still running into this problem?
I still have the same issue after setting export NCCL_P2P_LEVEL=NVL
If you haven't already, can you try working off one of the recommended images in the top-level README, making sure that your code is up-to-date with the main branch, and re-installing to get all the latest dependencies? Basically, I'm wondering if this happens after following the install/set-up instructions in the README.
NCCL errors are notoriously hard to diagnose, so it'd be helpful to see if this is just an environment issue. But, honestly, there's not a lot to go off of here, so I can't make any promises.
I updated my code with the main branch and reinstalled the whole environment. It's working now... thanks
Follow-up on this error: when I use a larger local dataset, I get the error again when saving the checkpoint. Do you have any ideas on that? thanks
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 601762 milliseconds before timing out. .... .... composer.core.engine: Post-closing callback RuntimeEstimator [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. ERROR:composer.cli.launcher:Global rank 0 (PID 160339) exited with code -6
Follow-up on this error: when I use a larger local dataset, I get the error again when saving the checkpoint. Do you have any ideas on that? thanks
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 601762 milliseconds before timing out. .... .... composer.core.engine: Post-closing callback RuntimeEstimator [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. ERROR:composer.cli.launcher:Global rank 0 (PID 160339) exited with code -6
I am also getting the same error while saving the model.
此错误的后续:当我使用较大的本地数据集时,保存检查点时再次收到错误。您对此有什么想法吗?谢谢 运行时错误:NCCL 通信器在等级 0 上中止。失败的原始原因是:[等级 0] 看门狗捕获集体操作超时:WorkNCCL(SeqNum=29544, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) 在超时之前运行了 601762 毫秒。.... ....composer.core.engine:关闭后回调 RuntimeEstimator [E ProcessGroupNCCL.cpp:455] 某些 NCCL 操作失败或超时。由于 CUDA 内核的异步特性,后续 GPU 操作可能会在损坏/不完整的数据上运行。[E ProcessGroupNCCL.cpp:460] 为了避免数据不一致,我们将取消整个流程。错误:composer.cli.launcher:全局排名 0 (PID 160339) 退出,代码为 -6
保存模型时我也遇到同样的错误。
Im sure the problem comes from saving model on multi gpus,but i cant fix it。Do u fix it now?
We have fixed the issue with nccl timeouts at the end of runs while saving checkpoints. I have a feeling there may be a variety of issues in this thread at this point, so I am going to close this issue. Please open a new issue if you are still encountering problems.
Hi, I am trying to finetune the MPT-7B model using a local dataset on 2 A100 - 80GB GPUs. Below is the complete log. Torch Version: 1.13.1+cu117 Appreciate any help to resolve the issue.
/mpt-7b/llm-foundry/scripts/train# composer train.py yamls/finetune/mpt-7b_jokes.yaml Initializing model... Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: Whileattn_impl: triton
can be faster thanattn_impl: flash
it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend usingattn_impl: flash
if your model does not usealibi
orprefix_lm
. warnings.warn('Whileattn_impl: triton
can be faster thanattn_impl: flash
' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'usingattn_impl: flash
if your model does not usealibi
orprefix_lm
.') Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.42s/it] cfg.n_params=6.65e+09 Building train loader... Using pad_token, but it is not set yet. No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Building eval loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow Building trainer... Logging config... max_seq_len: 2048 global_seed: 17 run_name: mpt-7b-finetune model: name: hf_causal_lm pretrained: true pretrained_model_name_or_path: mosaicml/mpt-7b config_overrides: attn_config: attn_impl: triton attn_uses_sequence_id: false tokenizer: name: mosaicml/mpt-7b kwargs: model_max_length: ${max_seq_len} train_loader: name: finetuning dataset: hf_name: local_dataset split: train tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 eval_loader: name: finetuning dataset: hf_name: local_dataset split: test tokenizer_name: ${tokenizer_name} max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0 scheduler: name: linear_decay_with_warmup t_warmup: 50ba alpha_f: 0 optimizer: name: decoupled_adamw lr: 5.0e-06 betas:Config: node_name: unknown because NODENAME environment variable not set num_gpus_per_node: 2 num_nodes: 1 rank_zero_seed: 17
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. ERROR:composer.cli.launcher:Rank 1 crashed with exit code -6. Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=_ALLGATHER_BASE, Timeout(ms)=600000) ran for 600963 milliseconds before timing out. Global rank 0 (PID 786) exited with code -6 Global rank 1 (PID 787) exited with code -6 ----------Begin global rank 1 STDOUT---------- Initializing model... cfg.n_params=6.65e+09 Building train loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building eval loader... No preprocessor was supplied and no preprocessing function is registered for dataset name "local_dataset". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message. Building trainer...
----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR---------- Explicitly passing a
revision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/14958374ab073ba1030c0caef4ae8380045bae45/attention.py:153: UserWarning: Whileattn_impl: triton
can be faster thanattn_impl: flash
it uses more memory. When training larger models this can trigger alloc retries which hurts performance. If encountered, we recommend usingattn_impl: flash
if your model does not usealibi
orprefix_lm
. warnings.warn('Whileattn_impl: triton
can be faster thanattn_impl: flash
' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'usingattn_impl: flash
if your model does not usealibi
orprefix_lm
.')Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|████████████████████████████████████████████████████████████ | 1/2 [00:08<00:08, 8.54s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 4.94s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.48s/it] Using pad_token, but it is not set yet. Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e96cabec8ddb1637.arrow Found cached dataset json (/root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) Loading cached processed dataset at /root/.cache/huggingface/datasets/json/local_dataset-f2d32eae91a10861/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-76316eeb9f4e44d9.arrow [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out. [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, Timeout(ms)=600000) ran for 601403 milliseconds before timing out.
----------End global rank 1 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 786) exited with code -6 /mpt-7b/llm-foundry/scripts/train#