vwxyzjn / lm-human-preference-details

RLHF implementation details of OAI's 2019 codebase
MIT License
145 stars 7 forks source link

2nd device (DO NOT MERGE) #15

Closed vwxyzjn closed 1 year ago

vwxyzjn commented 1 year ago

This PR creates a variant where the reference model and reward model live on separate devices.

python lm_human_preference_details/train_policy_accelerate2.py \
    --rewards.trained_model '' \
    --base_model tiiuae/falcon-7b \
    --ppo.local_batch_size 1 \
    --ppo.no_whiten_rewards