Closed Dahoas closed 1 year ago
It's a bit hard to follow your Issue
Is loading working when you use <=6 gpus?
I can't quite see from your example of the model itself how you run it - I suppose some modified version of the HF Trainer example program? unless what you run is what you shared here.
What you have shown doesn't use Deepspeed, you're just using the deepspeed
launcher and the args are ignored since you're not parsing them. So this program simply runs this script you have shown on each gpu separately - no deepspeed.
Also have a look at the size of the saved model - to ensure that it was saved in half-precision or full precision, which could be a 2x multiplier if you aren't doing it correctly.
To use the HF Deepspeed integration you need to adapt one of the examples or write a new program following the examples as the guide. https://github.com/huggingface/transformers/tree/main/examples/pytorch
The integration is inside the HF Trainer, so once you switch to using the HF Trainer you will get the DS integration.
Ah my apologies this is confusing. My training script is below. I'm only using the HF Trainer
import os
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy, AutoModel, AutoConfig, PreTrainedModel
import json
from reward_model import GPTRewardModel
import deepspeed
class PairwiseTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
# forward pass
rewards = model(**inputs)
rewards_chunked = rewards.view((2, -1))
chosen_rewards = rewards_chunked[0]
rejected_rewards = rewards_chunked[1]
# compute pairwise loss
loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
return (loss, outputs) if return_outputs else loss
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer.pad_token = tokenizer.eos_token
training_args = TrainingArguments(output_dir='./results', num_train_epochs=4, logging_steps=100, save_strategy=IntervalStrategy.NO,
per_device_train_batch_size=1, per_device_eval_batch_size=1, warmup_steps=100,
weight_decay=0.01, logging_dir='./logs', fp16=True, bf16=False, learning_rate=5e-6, deepspeed='./ds_config_gpt_2.json')
# gptneo trained in jaxh
model = GPTRewardModel("EleutherAI/gpt-neo-2.7B")
load_checkpoint = True
if load_checkpoint:
model.load_state_dict(torch.load('ckpts/single_context_pairwise/model_fp16.pt'))
#model.cuda()
data = []
dataset_name = "single_context_pairwise"
with open(dataset_name + ".jsonl", "r") as f:
lines = f.readlines()
for line in lines:
loaded_line = json.loads(line)
data.append(loaded_line)
#data.append(loaded_line["prompt"] + loaded_line["response"])
print("Len data: ", len(data))
max_length = 1024
#max_length = max([max(len(tokenizer.encode(text["chosen"])), len(tokenizer.encode(text["rejected"]))) for text in data])
print("Max length: {}".format(max_length))
class PairwiseDataset(Dataset):
def __init__(self, pairs, tokenizer, max_length):
self.chosen_input_ids = []
self.chosen_attn_masks = []
self.rejected_input_ids = []
self.rejected_attn_masks = []
for pair in pairs:
chosen, rejected = pair["chosen"], pair["rejected"]
chosen_encodings_dict = tokenizer('<|startoftext|>' + chosen + '<|endoftext|>', truncation=True,
max_length=max_length, padding="max_length", return_tensors="pt")
rejected_encodings_dict = tokenizer('<|startoftext|>' + rejected + '<|endoftext|>', truncation=True,
max_length=max_length, padding="max_length", return_tensors="pt")
self.chosen_input_ids.append(chosen_encodings_dict['input_ids'])
self.chosen_attn_masks.append(chosen_encodings_dict['attention_mask'])
self.rejected_input_ids.append(rejected_encodings_dict['input_ids'])
self.rejected_attn_masks.append(rejected_encodings_dict['attention_mask'])
def __len__(self):
return len(self.chosen_input_ids)
def __getitem__(self, idx):
return self.chosen_input_ids[idx], self.chosen_attn_masks[idx], self.rejected_input_ids[idx], self.rejected_attn_masks[idx]
def data_collator(data):
return {'input_ids': torch.stack([f[0] for f in data] + [f[2] for f in data]),
'attention_mask': torch.stack([f[1] for f in data] + [f[3] for f in data])}
dataset = PairwiseDataset(data, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
PairwiseTrainer(model=model, args=training_args, train_dataset=train_dataset,
eval_dataset=val_dataset, data_collator=data_collator).train()
if torch.distributed.get_rank() == 0:
print("SAVING MODEL")
dir_path = os.path.join("ckpts", dataset_name)
if not os.path.isdir(dir_path):
os.mkdir(dir_path)
torch.save(model.state_dict(), os.path.join(dir_path, "model_fp16_8.pt"))
Yes loading works <= 6 gpus.
Good point about saving in the wrong precision. I will check
much better.
Also try first with a normal model of the same size? If it works just fine then it'd point to something being added with your code.
If there is problem with normal model then it's a different story..
One other thing to consider, is that if you resume from a saved deepspeed checkpoint, you can't change topology on fly, as it'll try to resume using the same sharded layout as the checkpoint was saved from. But if you were to try to change the topology on the existing DS checkpoint it'd normally fail to resume.
So typically in changing topology you need to extract the non-sharded weights and then start a new using those instead of using resume. Here since it appears you use zero-stage2 it's trivial, it's just the saved weights file as weights were never sharded in the first place (they do under stage3). so to test on topology change I'd move your output_dir
elsewhere and simply pass the weights file as the model_name_or_path
I am concerned that I'm wrote above is confusing, I'm just trying to guess what might be going wrong for you.
Update: Indeed I was saving and loading fp16 weights when I meant to be saving/loading fp32. (Although I still do not understand why loading fp16 in the manner I do throws an OOM error).
In any case thanks for your help!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.21.2Who can help?
Hi all,
I am modifying an arbitrary HF text model for reinforcement learning reward modeling by appending a scalar output head and overriding the forward method. As part of this procedure I'd prefer to retain the flexibility of using any model without committing to a particular model class (e.g. GPT2). I have not found a way to inherit the PreTrainedModel class while also retaining this flexibility so the result is just a nn.Module class.
I find when I try to torch.load to continue training a reward model fine-tuned using GPTNeo2.7B as a base I OOM when with >6 gpus (A100). This is counter-intuitive to me as I would expect OOM issues in the opposite direction.
To train the reward model I am using HF's deepspeed integration. Tagging @stas00 as deepspeed integration point of contact.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
To launch run
deepspeed --num_gpus=7 test_pretrained.py --deepspeed ds_config_gpt_2.json
Expected behavior
No OOM with more gpus