Closed vwxyzjn closed 1 year ago
Confirmed that it can reasonably run 7b models (no benchmark results yet)
SAVE_PATH_REWARD="models/train_7b_$(date +%s)/reward.pt"
SAVE_PATH_POLICY="models/train_7b_$(date +%s)/policy.pt"
poetry run accelerate launch --config_file deepspeed.yaml lm_human_preference_details/train_reward_accelerate.py \
--base_model cerebras/Cerebras-GPT-6.7B \
--no_use_tensorflow_adam \
--gradient_accumulation_steps=4 \
--local_rollout_batch_size=4 \
--save_path=$SAVE_PATH_REWARD \
--track && \
poetry run accelerate launch --config_file deepspeed.yaml lm_human_preference_details/train_policy_accelerate.py \
--rewards.trained_model=$SAVE_PATH_REWARD \
--base_model=cerebras/Cerebras-GPT-6.7B \
--deepspeed \
--no_use_tensorflow_adam \
--ppo.gradient_accumulation_steps 64 \
--track
https://wandb.ai/costa-huang/cleanRL/runs/hn9wtka9?workspace=user-costa-huang
This PR attempts to bring deepspeed integration to empower tuning with 7B models. In the summarize from human feedback paper, the experimented with 1.3B, 2.7B, and 6.7B models, so this PR would in principle allow us to replicate that work.
Some of the notable changes needed to make things work:
mixed_precision: 'bf16'
turns out to be important, otherwise OOM.accelerator.prepare
anddeepspeed.initialize
, otherwise OOM.bf16
forreward_model
andref_policy
, otherwise OOM.critic_model
, which they finally offloadreward_model
,critic_model
, andref_policy
to CPU, but it is not necessary in our case.Here is a training run https://wandb.ai/costa-huang/cleanRL/runs/kve7tu43/overview with
Training results was pretty bad, but I think this is probably some issue related to model compatibility. To replicate summarize from human feedback paper, we should probably use the OPT models which have 1.3B, 2.7B, and 6.7B models.
CC @lewtun