transformers: 4.39.0.dev0
trl: 0.7.10
torch: 2.2.2
8 x H100 (80GB)
I am encountering an issue where the training process with DPO on a multi-GPU setup gets stuck. This problem arises when I attempt to launch the training using the accelerate CLI with DeepSpeed's ZeRO-3 configuration.
Steps to Reproduce:
Clone the Alignment Handbook repository:
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
Install dependencies:
pip install wheel
python -m pip install .
Launch the training script with the specified configuration:
Expected vs. Actual Behavior:
Expected: Smooth utilization of multi-GPU for training without interruptions.
Actual: The process halts immediately after displaying the user warning:
UserWarning: You passed a model_id to the DPOTrainer. This will automatically create an `AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you.
Environment:
transformers: 4.39.0.dev0 trl: 0.7.10 torch: 2.2.2 8 x H100 (80GB)
I am encountering an issue where the training process with DPO on a multi-GPU setup gets stuck. This problem arises when I attempt to launch the training using the accelerate CLI with DeepSpeed's ZeRO-3 configuration.
Steps to Reproduce:
Clone the Alignment Handbook repository:
Install dependencies:
Launch the training script with the specified configuration:
Expected vs. Actual Behavior: Expected: Smooth utilization of multi-GPU for training without interruptions. Actual: The process halts immediately after displaying the user warning:
Post this warning, there's no progression.