Loading model OOMs with more GPUS

System Info

transformers version: 4.21.2
Platform: Linux-5.10.135-122.509.amzn2.x86_64-x86_64-with-glibc2.2.5
Python version: 3.8.5
Huggingface_hub version: 0.10.0
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

Hi all,

I am modifying an arbitrary HF text model for reinforcement learning reward modeling by appending a scalar output head and overriding the forward method. As part of this procedure I'd prefer to retain the flexibility of using any model without committing to a particular model class (e.g. GPT2). I have not found a way to inherit the PreTrainedModel class while also retaining this flexibility so the result is just a nn.Module class.

I find when I try to torch.load to continue training a reward model fine-tuned using GPTNeo2.7B as a base I OOM when with >6 gpus (A100). This is counter-intuitive to me as I would expect OOM issues in the opposite direction.

To train the reward model I am using HF's deepspeed integration. Tagging @stas00 as deepspeed integration point of contact.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

import os
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy, AutoModel, AutoConfig, PreTrainedModel
import json
import deepspeed
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Model, PreTrainedModel, AutoModelForCausalLM, GPT2PreTrainedModel, GPT2Model
from transformers.modeling_outputs import ModelOutput
from torch import nn
from torch.nn import Identity
import torch.nn.functional as F
import torch
from dataclasses import dataclass
from typing import Optional, Tuple

class GPTRewardModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        model = AutoModelForCausalLM.from_pretrained(config)
        self.config = model.config
        # gpt-neo models have hidden_size instead of n_embd
        self.config.n_embd = self.config.hidden_size if hasattr(self.config, "hidden_size") else self.config.n_embd
        self.transformer = model.transformer
        self.v_head = nn.Linear(self.config.n_embd, 1, bias=False)

    def forward(
        self,
        input_ids=None,
        past_key_values=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        mc_token_ids=None,
        lm_labels=None,
        mc_labels=None,
        return_dict=False,
        output_attentions=False,
        output_hidden_states=False,
    ):
        loss=None
        transformer_outputs = self.transformer(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )

        hidden_states = transformer_outputs[0]

        rewards = self.v_head(hidden_states).squeeze(-1)

        return rewards

model = GPTRewardModel("EleutherAI/gpt-neo-2.7B")
if torch.distributed.get_rank() == 0:
    torch.save(model.state_dict(), "model_fp16.pt")
model.load_state_dict(torch.load('model_fp16.pt'))

{
    "train_batch_size": 8,
    "fp16": {
      "enabled": "auto",
      "min_loss_scale": 1,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "initial_scale_power": 32
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
      "stage": 2,
      "offload_param": {
        "device": "none"
      },
      "offload_optimizer": {
        "device": "none"
      },
      "allgather_partitions": true,
      "allgather_bucket_size": 5e8,
      "contiguous_gradients": true
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": [
          0.9,
          0.999
        ],
        "eps": 1e-08
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": "auto",
        "warmup_num_steps": 100
      }
    }
  }

To launch run deepspeed --num_gpus=7 test_pretrained.py --deepspeed ds_config_gpt_2.json

Expected behavior

No OOM with more gpus

It's a bit hard to follow your Issue

Is loading working when you use <=6 gpus?

I can't quite see from your example of the model itself how you run it - I suppose some modified version of the HF Trainer example program? unless what you run is what you shared here.

What you have shown doesn't use Deepspeed, you're just using the deepspeed launcher and the args are ignored since you're not parsing them. So this program simply runs this script you have shown on each gpu separately - no deepspeed.

Also have a look at the size of the saved model - to ensure that it was saved in half-precision or full precision, which could be a 2x multiplier if you aren't doing it correctly.

To use the HF Deepspeed integration you need to adapt one of the examples or write a new program following the examples as the guide. https://github.com/huggingface/transformers/tree/main/examples/pytorch

The integration is inside the HF Trainer, so once you switch to using the HF Trainer you will get the DS integration.

Ah my apologies this is confusing. My training script is below. I'm only using the HF Trainer

import os
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy, AutoModel, AutoConfig, PreTrainedModel
import json
from reward_model import GPTRewardModel
import deepspeed

class PairwiseTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # forward pass
        rewards = model(**inputs)
        rewards_chunked = rewards.view((2, -1))
        chosen_rewards = rewards_chunked[0]
        rejected_rewards = rewards_chunked[1]
        # compute pairwise loss
        loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
        return (loss, outputs) if return_outputs else loss

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer.pad_token = tokenizer.eos_token
training_args = TrainingArguments(output_dir='./results', num_train_epochs=4, logging_steps=100, save_strategy=IntervalStrategy.NO,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1, warmup_steps=100,
                                  weight_decay=0.01, logging_dir='./logs', fp16=True, bf16=False, learning_rate=5e-6, deepspeed='./ds_config_gpt_2.json')
# gptneo trained in jaxh

model = GPTRewardModel("EleutherAI/gpt-neo-2.7B")
load_checkpoint = True
if load_checkpoint:
    model.load_state_dict(torch.load('ckpts/single_context_pairwise/model_fp16.pt'))
#model.cuda()

data = []
dataset_name = "single_context_pairwise"
with open(dataset_name + ".jsonl", "r") as f:
    lines = f.readlines()
    for line in lines:
        loaded_line = json.loads(line)
        data.append(loaded_line)
        #data.append(loaded_line["prompt"] + loaded_line["response"])
print("Len data: ", len(data))

max_length = 1024
#max_length = max([max(len(tokenizer.encode(text["chosen"])), len(tokenizer.encode(text["rejected"]))) for text in data])
print("Max length: {}".format(max_length))

class PairwiseDataset(Dataset):
    def __init__(self, pairs, tokenizer, max_length):
        self.chosen_input_ids = []
        self.chosen_attn_masks = []
        self.rejected_input_ids = []
        self.rejected_attn_masks = []
        for pair in pairs:
            chosen, rejected = pair["chosen"], pair["rejected"]
            chosen_encodings_dict = tokenizer('<|startoftext|>' + chosen + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length", return_tensors="pt")
            rejected_encodings_dict = tokenizer('<|startoftext|>' + rejected + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length", return_tensors="pt")
            self.chosen_input_ids.append(chosen_encodings_dict['input_ids'])
            self.chosen_attn_masks.append(chosen_encodings_dict['attention_mask'])
            self.rejected_input_ids.append(rejected_encodings_dict['input_ids'])
            self.rejected_attn_masks.append(rejected_encodings_dict['attention_mask'])

    def __len__(self):
        return len(self.chosen_input_ids)

    def __getitem__(self, idx):
        return self.chosen_input_ids[idx], self.chosen_attn_masks[idx], self.rejected_input_ids[idx], self.rejected_attn_masks[idx]

def data_collator(data):
    return {'input_ids': torch.stack([f[0] for f in data] + [f[2] for f in data]),
            'attention_mask': torch.stack([f[1] for f in data] + [f[3] for f in data])}

dataset = PairwiseDataset(data, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])
PairwiseTrainer(model=model, args=training_args, train_dataset=train_dataset,
        eval_dataset=val_dataset, data_collator=data_collator).train()

if torch.distributed.get_rank() == 0:
    print("SAVING MODEL")
    dir_path = os.path.join("ckpts", dataset_name)
    if not os.path.isdir(dir_path):
        os.mkdir(dir_path)
    torch.save(model.state_dict(), os.path.join(dir_path, "model_fp16_8.pt"))

Yes loading works <= 6 gpus.

Good point about saving in the wrong precision. I will check

much better.

Also try first with a normal model of the same size? If it works just fine then it'd point to something being added with your code.

If there is problem with normal model then it's a different story..

One other thing to consider, is that if you resume from a saved deepspeed checkpoint, you can't change topology on fly, as it'll try to resume using the same sharded layout as the checkpoint was saved from. But if you were to try to change the topology on the existing DS checkpoint it'd normally fail to resume.

So typically in changing topology you need to extract the non-sharded weights and then start a new using those instead of using resume. Here since it appears you use zero-stage2 it's trivial, it's just the saved weights file as weights were never sharded in the first place (they do under stage3). so to test on topology change I'd move your output_dir elsewhere and simply pass the weights file as the model_name_or_path

I am concerned that I'm wrote above is confusing, I'm just trying to guess what might be going wrong for you.

Update: Indeed I was saving and loading fp16 weights when I meant to be saving/loading fp32. (Although I still do not understand why loading fp16 in the manner I do throws an OOM error).

In any case thanks for your help!

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers