Closed Randl closed 6 months ago
Hi @Randl, thanks for asking. This is normal for nccl backend. I invite you to read the description of the timeout
arg in the related doc to have more information.
Excerpt :
timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value equals 30 minutes. This is applicable for the gloo backend. For nccl, this is applicable only if the environment variable NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1.
@Randl can you rerun your code building accelerate from pip install git+https://github.com/huggingface/accelerate@check-for-nccl
and verify we can catch this early? (And that is indeed what is wrong with your setup?) 😄
@muellerzr
NCCL_ASYNC_ERROR_HANDLING
is set to 1
(by some of the libraries I use, I guess? I didn't set it).
In fact, the function changed in this branch is called only twice in my code, both from training_args
https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L1871-L1873
once with self.backend=nccl
and once with self.backend=None
. InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))
can't even influence it?
I've also tried to set --ddp_timeout=10800
(this is what passed from training_args
) in my command, and it is passed to this function only in the second call; I still get the 30-minute timeout in my code.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I don't think it was addressed?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
still not resolved?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
...
Looking into this again this week, sorry for the delay
I'm definitely seeing an effect here. Note that timeout only applies on situations where wait_for_everyone
(or gather, etc) has been called.
Minimal test:
import time
from datetime import timedelta
from accelerate import Accelerator, InitProcessGroupKwargs
from torch import tensor
kwargs = [InitProcessGroupKwargs(timeout=timedelta(seconds=4))]
accelerator = Accelerator(kwargs_handlers=kwargs)
if accelerator.is_main_process:
t = tensor(0).to(accelerator.device)
time.sleep(8)
else:
t = tensor(0).to(accelerator.device)
accelerator.wait_for_everyone()
print("All called!")
This will lead to a failure, change that 4 to a 10 and it'll pass.
Can you give us more of your trace? It doesn't hint at where it's failing at.
I don't have the access to the machine currently. I'll update you when I can run stuff on it. I don't think there was any additional information there. From logs, it's failing after uploading the checkpoint to the hub, ie somewhere around https://github.com/huggingface/alignment-handbook/blob/ff618a4d13a2c77cf97479fac8af2c576619062a/scripts/run_sft.py#L203-L205
Thanks, that's helpful
I see the exact issue, it's due to SFTTrainer
, and is not an accelerate issue (though it is accelerate adjacent). Can you open an issue in trl
for this and ping me?
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
and run the training
Note that timeout is still 1800 secconds (see also https://github.com/huggingface/alignment-handbook/issues/59)
Expected behavior
Timeout is increased, and no crush.