Open swang99 opened 1 year ago
Hi @swang99,
Did you also not set deepspeed_enable
to True in the config.yaml? Because this has to be done.
If so, can you please share the config.yaml as well as the deepspeed config you used?
Hi @EikeKohl. Yes, here is my config.yaml
---
trainer_config:
# learning rates
actor_lr: 0.000005
critic_lr: 0.000009
# PPO Hyperparameters
actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.02
# coefficient for the discounted rewards
gamma_discounted: 1
# path to examples to be sampled (training dataset) see rlhf_dataset.json
examples_path: "./datasets/rlhf_training_data.json"
# number of episodes and generation performed for each episode
# in the train() method
num_episodes: 100
max_timesteps: 32
# number of timesteps after which the learn() method is called
# (to update the weights)
update_timesteps: 32
# number of example sampled at each timestep
num_examples: 1
# batch and epochs for the training
batch_size: 1
epochs: 1
# number of episodes after which update the checkpoints in RL training
checkpoint_steps: 1000
# here specify the name of the actor_rl checkpoint from which resume
# during actor RL training. If null load the last one.
checkpoint_name: null
actor_config:
model: "facebook/opt-1.3b"
model_folder: "./models"
tokenizer_path: "path-to-tokenizer"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null
# froze model embedding during training
froze_embeddings: True
# use fairscale layers to build the model instead of vanilla pytorch
# only for llama
use_fairscale: False
# max sequence length for the actor (i.e. prompt + completion) it depends on
# the model used.
max_sequence_length: 2048
# max tokens generated by the actor (completion only)
max_tokens: 2048
# minimum number of tokens generated by the actor
min_tokens: 100
# additional prompt tokens to be used for template or as safety
additonal_prompt_tokens: 20
# temperature for the actor
temperature: 0.1
batch_size: 2
# number iteration after print
iteration_per_print: 1
lr: 0.000009
epochs: 1
# number of backpropagation after saving the checkpoints
checkpoint_steps: 5000
# number of checkpoints to keep while removing the older
# (keep memory consumption of checkpoints reasonable)
n_checkpoints_to_keep: 5
# here specify the name of the actor checkpoint from which resume
# during actor training. If null load the last one.
checkpoint_name: null
# deepspeed settings
deepspeed_enable: True
deepspeed_config_path: "./artifacts/config/ds_config.json"
# accelerate settings
accelerate_enable: False
# use_peft - the parameters of PEFT can be modified in the peft_config.yaml
peft_enable: False
peft_config_path: "./artifacts/config/peft_config.yaml"
reward_config:
# model to be chosen are gp2-large, bart-base, longformer-base-4096
# more can be simply added in the reward.py __init__()
model: "facebook/opt-125m"
model_folder: "./models"
# hidden size of the additional ffw head to produce the scores
model_head_hidden_size: 2048
max_sequence_length: 2048
train_dataset_path: "./datasets/reward_training_data.json"
validation_dataset_path: null
batch_size: 8
epochs: 1
iteration_per_print: 1
# steps after which the checkpoint are saved
checkpoint_steps: 10000
# here specify the name of the reward checkpoint from which resume
# during reward training. If null load the last one.
checkpoint_name: null
lr: 0.000009
# deepspeed settings
deepspeed_enable: True
deepspeed_config_path: "./artifacts/config/ds_config.json"
# accelerate settings
accelerate_enable: False
critic_config:
# model to be chosen are gp2-large, bart-base, longformer-base-4096
# more can be simply added in the reward.py __init__()
model: "facebook/opt-125m"
# hidden size of the additional ffw head to produce the scores
model_head_hidden_size: 2048
max_sequence_length: 2048
model_folder: "./models"
# here specify the name of the critic checkpoint from which resume
# during critic training. If null load the last one.
checkpoint_name: null
And the ds_config.json:
{
"train_batch_size": 8,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"fp16": {
"enabled": false,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"offload_param": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"offload_optimizer": {
"device": "cpu",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"stage3_max_live_parameters" : 1e9,
"stage3_max_reuse_distance" : 1e9,
"stage3_prefetch_bucket_size" : 5e8,
"stage3_param_persistence_threshold" : 1e6,
"sub_group_size" : 1e12,
"elastic_checkpoint" : true,
"stage3_gather_16bit_weights_on_model_save": true,
"ignore_unused_parameters": true,
"round_robin_gradients": true
}
}
Hi @swang99 thanks for opening the issue. I think I got the problem, If you look to the PR #306 there is an updated config.yaml (with added fields for distributed training with RL) It seems that yours is not updated, could you check that? Thanks again!
@swang99 was your issue resolved with the latest repo state? If not, could you please share your nvidia-smi output? And you could also try zero optimization stage 3 (https://www.deepspeed.ai/tutorials/zero/#zero-overview)
I've been having issues trying to distribute training onto multiple GPUs. Even after following this pull request https://github.com/nebuly-ai/nebullvm/pull/316, I check the nvidia-smi log and it still shows all the load being on one GPU, whether using deepspeed or accelerate.
Here are the commands I tried to train the actor models:
deepspeed artifacts/main.py artifacts/config/config.yaml --type RL
accelerate launch artifacts/main.py artifacts/config/config.yaml --type RL
Everything else I simply kept as default configuration and did not touch anything else. Any ideas? Thank you.