Deepspeed integration for 7B models

This PR attempts to bring deepspeed integration to empower tuning with 7B models. In the summarize from human feedback paper, the experimented with 1.3B, 2.7B, and 6.7B models, so this PR would in principle allow us to replicate that work.

Some of the notable changes needed to make things work:

mixed_precision: 'bf16' turns out to be important, otherwise OOM.
Initialize all models on CPU first and use accelerator.prepare and deepspeed.initialize, otherwise OOM.
Enable bf16 for reward_model and ref_policy, otherwise OOM.
Do not log the histogram of ratio, otherwise OOM.
Enable gradient checkpointing, otherwise OOM.
In https://github.com/OpenLMLab/MOSS-RLHF/blob/40b91eb2f2b71b16919addede0341d2bef70825d/utils.py#L41-L43, they have an additional critic_model, which they finally offload reward_model, critic_model, and ref_policy to CPU, but it is not necessary in our case.

Here is a training run https://wandb.ai/costa-huang/cleanRL/runs/kve7tu43/overview with

accelerate launch --config_file deepspeed.yaml lm_human_preference_details/train_policy_accelerate.py \
    --rewards.trained_model ''  \
    --base_model tiiuae/falcon-7b  \
    --no_use_tensorflow_adam \
    --ppo.gradient_accumulation_steps 64 \
    --track

Training results was pretty bad, but I think this is probably some issue related to model compatibility. To replicate summarize from human feedback paper, we should probably use the OPT models which have 1.3B, 2.7B, and 6.7B models.

CC @lewtun

Confirmed that it can reasonably run 7b models (no benchmark results yet)


SAVE_PATH_REWARD="models/train_7b_$(date +%s)/reward.pt"
SAVE_PATH_POLICY="models/train_7b_$(date +%s)/policy.pt"
poetry run accelerate launch --config_file deepspeed.yaml  lm_human_preference_details/train_reward_accelerate.py \
    --base_model cerebras/Cerebras-GPT-6.7B \
    --no_use_tensorflow_adam \
    --gradient_accumulation_steps=4 \
    --local_rollout_batch_size=4 \
    --save_path=$SAVE_PATH_REWARD \
    --track && \
    poetry run accelerate launch --config_file deepspeed.yaml  lm_human_preference_details/train_policy_accelerate.py \
    --rewards.trained_model=$SAVE_PATH_REWARD \
    --base_model=cerebras/Cerebras-GPT-6.7B \
    --deepspeed \
    --no_use_tensorflow_adam \
    --ppo.gradient_accumulation_steps 64 \
    --track

https://wandb.ai/costa-huang/cleanRL/runs/hn9wtka9?workspace=user-costa-huang

vwxyzjn / lm-human-preference-details

Deepspeed integration for 7B models #19