Closed vwxyzjn closed 1 year ago
This PR creates a variant where the reference model and reward model live on separate devices.
python lm_human_preference_details/train_policy_accelerate2.py \ --rewards.trained_model '' \ --base_model tiiuae/falcon-7b \ --ppo.local_batch_size 1 \ --ppo.no_whiten_rewards
This PR creates a variant where the reference model and reward model live on separate devices.