meta-llama / llama

Inference code for Llama models
Other
56.23k stars 9.55k forks source link

Stuck when I run inference #194

Open BeachWang opened 1 year ago

BeachWang commented 1 year ago

I ran the 65B model in 8 * A100 (80G). But I found that it stuck in allreduce and reported the following error with my own edited prompt. RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18632, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807549 milliseconds before timing out. There was no such error when I ran the example.py with the original prompts. But it occurred when I used the following prompt instead of the original prompts.

"Answer the following questions with `Yes` or `No`. Question: There are `Samoa`, `Angola`, `Lebanon`, `Zambia` and `Cocos (Keeling) Islands` in column `English_short_name_lower_case`. Trere are `80`, `0`, `52`, `591` and `18` in column `Country`. Do the contents in column `English_short_name_lower_case` and column `Country` belong to the same category. Answer: "

Dose anyone else have this problem?

DhruvaBansal00 commented 1 year ago

Also seeing this error. Were you able to resolve it @BeachWang ?

Alex-HaochenLi commented 1 year ago

Also meeting the same problem. Could anyone tell the reason?

BeachWang commented 1 year ago

We have decided not to utilize up to 65B :(