Closed EQ3A2A closed 3 weeks ago
mark
Hi did you solve this problem? Same problem here.
Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?
Hello @EQ3A2A @TongLiu-github can you please share an example that reproduces the error with a public dataset I can test?
Thanks for the reply. I solved this problem from: https://github.com/huggingface/accelerate/issues/314#issue-1201142707
Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?
The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user: https://github.com/huggingface/trl/blob/2cad48d511fab99ac0c4b327195523a575afcad3/trl/trainer/dpo_trainer.py#L923
In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.
If that resolves the issue, feel free to close it.
Thanks @TongLiu-github - do I understand correctly that you were experiencing a NCCL timeout error instead?
The reason you are seeing Stage 0 in the logs is because we initialise the initialise the reference model in this stage unless Stage 3 is set by the user:
In the screenshot below, I compare DDP vs ZeRO-3 and one indeed sees the memory used by the latter is smaller.
If that resolves the issue, feel free to close it.
Hello @lewtun. So, if I understand correctly, this is just an issue with how the logs are displayed, and Zero2 is actually enabled, right?
Hi @Joe-Hall-Lee yes that's correct: the logs from deepspeed are showing the initialisation of the reference model
System Info
transformers
version: 4.44.2Information
Tasks
examples
folderReproduction
The accelerate config file I'm using
deepspeed_config.yaml
The training script I'm using
train.py
Run the script with accelerate
Expected behavior
The Zero2 is not working(Set to Zero0)