I ran the 65B model in 8 * A100 (80G). But I found that it stuck in allreduce and reported the following error with my own edited prompt.
RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18632, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807549 milliseconds before timing out.
There was no such error when I ran the example.py with the original prompts. But it occurred when I used the following prompt instead of the original prompts.
"Answer the following questions with `Yes` or `No`.
Question: There are `Samoa`, `Angola`, `Lebanon`, `Zambia` and `Cocos (Keeling) Islands` in column `English_short_name_lower_case`. Trere are `80`, `0`, `52`, `591` and `18` in column `Country`. Do the contents in column `English_short_name_lower_case` and column `Country` belong to the same category.
Answer: "
I ran the 65B model in 8 * A100 (80G). But I found that it stuck in allreduce and reported the following error with my own edited prompt.
RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18632, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807549 milliseconds before timing out.
There was no such error when I ran the example.py with the original prompts. But it occurred when I used the following prompt instead of the original prompts."Answer the following questions with `Yes` or `No`. Question: There are `Samoa`, `Angola`, `Lebanon`, `Zambia` and `Cocos (Keeling) Islands` in column `English_short_name_lower_case`. Trere are `80`, `0`, `52`, `591` and `18` in column `Country`. Do the contents in column `English_short_name_lower_case` and column `Country` belong to the same category. Answer: "
Dose anyone else have this problem?