QLORA + FSDP distributed fine-tuning failed at the end during model saving stage

Neo9061 commented 4 months ago

System Info

Following on Philip's blogpost to conduct FSDP + QLoRA in SageMaker

Training script is the default one.
The model I used is mistral-community/Mixtral-8x22B-v0.1.
Training instances are 2 instances of P4DE.24XLARGE (each instance has 640GB GPU and 1024 GB CPU).

Training is completed but hit failure at the very last step model saving (this line), where loading base model and merging base model and adaptor have been completed.

Error is following.

Saving the newly created merged model to /opt/ml/model
--
[E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800906 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_errorstd::runtime_error'
'  what():    what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800889 milliseconds before timing out.[Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800882 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'  what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800907 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=783639, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800900 milliseconds before timing out.

I checked the memory usage and it seems not because of OOM. I wonder if that is because the model merging and saving step took a lot of time and make the process time out?

Who can help?

@ArthurZucker @philschmid @muellerzr @SunMarc

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

See descriptions above

Expected behavior

Error free

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ruian1 commented 3 months ago

I just had the same error, it happened when saving weights at end of epoch. Any suggestion?

ArthurZucker commented 3 months ago

Hey! Could you give a bit more details about the transformers version you are using / accelerate version ets: transformers-cli env should output that

ruian1 commented 2 months ago

@ArthurZucker thanks for taking a look! I was trying to fine-tune the model of idefic2, I also tried all transformer versions that enables idefics and none of them worked. The training loss looks all good. I found discussions here but (1) try more new trainings (2)set the timeout value to a much bigger number both failed(by failure in 2 I let it run for ~12 hours and still does not go to epoch 2).

- `transformers` version: 4.44.0.dev0
- Platform: Linux-5.10.0-31-cloud-amd64-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.24.2
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA A100-SXM4-80GB

ArthurZucker commented 2 months ago

I think this was fixed in the 4.44.1 patch!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers