How to disable model parallelism and enable data parallelism when using Accelerate and `device_map='auto'`?

chenmingjiong commented 1 year ago

System Info

transformers version: 4.27.0.dev0
Platform: Linux-5.4.15-1.el7.elrepo.x86_64-x86_64-with-glibc2.27
Python version: 3.10.9
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@pacman100

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I got this error when finetuning "EleutherAI/gpt-j-6B" using LoRA on 8×2080ti: RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Reproduce steps: clone this repo: https://github.com/CarperAI/trlx modify the script: examples/summarize_rlhf/sft/train_gptj_summarize.py

import random
import os
import evaluate
import numpy as np
import torch
import torch.nn as nn
from peft import LoraConfig, get_peft_model 
from summarize_dataset import TLDRDataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    default_data_collator,
)

def set_seed(seed_val=42):
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

if __name__ == "__main__":
    output_dir = "gptj-supervised-summarize-checkpoint"
    train_batch_size = 4
    gradient_accumulation_steps = 1
    learning_rate = 1e-5
    eval_batch_size = 1
    eval_steps = 500
    max_input_length = 550
    save_steps = 1000
    num_train_epochs = 5
    random.seed(42)
    os.environ["WANDB_DISABLED"] = "true"

    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
    model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", use_cache=False, load_in_8bit=True, device_map='auto')
    tokenizer.pad_token = tokenizer.eos_token
    model.resize_token_embeddings(len(tokenizer))
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model.config.end_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = model.config.eos_token_id

    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.ndim == 1:
            # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)
    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

    class CastOutputToFloat(nn.Sequential):
        def forward(self, x): return super().forward(x).to(torch.float32)
    model.lm_head = CastOutputToFloat(model.lm_head)

    config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)

    # Set up the datasets
    data_path = "CarperAI/openai_summarize_tldr"
    train_dataset = TLDRDataset(
        data_path,
        tokenizer,
        "train",
        max_length=max_input_length,
    )
    dev_dataset = TLDRDataset(
        data_path,
        tokenizer,
        "valid",
        max_length=max_input_length,
    )

    # Set up the metric
    rouge = evaluate.load("rouge")

    def compute_metrics(eval_preds):
        labels_ids = eval_preds.label_ids
        pred_ids = eval_preds.predictions
        pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
        label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
        result = rouge.compute(predictions=pred_str, references=label_str)
        return result

    # Create a preprocessing function to extract out the proper logits from the model output
    def preprocess_logits_for_metrics(logits, labels):
        if isinstance(logits, tuple):
            logits = logits[0]
        return logits.argmax(dim=-1)

    # Prepare the trainer and start training
    training_args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="steps",
        eval_accumulation_steps=1,
        learning_rate=learning_rate,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        gradient_checkpointing=True,
        half_precision_backend="auto",
        fp16=True,
        adam_beta1=0.9,
        adam_beta2=0.95,
        gradient_accumulation_steps=gradient_accumulation_steps,
        num_train_epochs=num_train_epochs,
        warmup_steps=100,
        eval_steps=eval_steps,
        save_steps=save_steps,
        load_best_model_at_end=True,
        logging_steps=50,
        # deepspeed="examples/summarize_rlhf/sft/ds_config_gptj.json",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        compute_metrics=compute_metrics,
        data_collator=default_data_collator,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    )
    trainer.train()
    trainer.save_model(output_dir)

and run: accelerate launch --num_processes 8 examples/summarize_rlhf/sft/train_gptj_summarize.py

Full error logs:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/trlx/examples/summarize_rlhf/sft/train_gptj_summarize_lora_acc.py:154 in <module>          │
│                                                                                                  │
│   151 │   │   data_collator=default_data_collator,                                               │
│   152 │   │   preprocess_logits_for_metrics=preprocess_logits_for_metrics,                       │
│   153 │   )                                                                                      │
│ ❱ 154 │   trainer.train()                                                                        │
│   155 │   trainer.save_model(output_dir)                                                         │
│   156                                                                                            │
│                                                                                                  │
│ /data/transformers/src/transformers/trainer.py:1631 in train                                     │
│                                                                                                  │
│   1628 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1629 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1630 │   │   )                                                                                 │
│ ❱ 1631 │   │   return inner_training_loop(                                                       │
│   1632 │   │   │   args=args,                                                                    │
│   1633 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1634 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /data/transformers/src/transformers/trainer.py:1898 in _inner_training_loop                      │
│                                                                                                  │
│   1895 │   │   │   │   │   with model.no_sync():                                                 │
│   1896 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1897 │   │   │   │   else:                                                                     │
│ ❱ 1898 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1899 │   │   │   │                                                                             │
│   1900 │   │   │   │   if (                                                                      │
│   1901 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /data/transformers/src/transformers/trainer.py:2643 in training_step                             │
│                                                                                                  │
│   2640 │   │   │   return loss_mb.reduce_mean().detach().to(self.args.device)                    │
│   2641 │   │                                                                                     │
│   2642 │   │   with self.compute_loss_context_manager():                                         │
│ ❱ 2643 │   │   │   loss = self.compute_loss(model, inputs)                                       │
│   2644 │   │                                                                                     │
│   2645 │   │   if self.args.n_gpu > 1:                                                           │
│   2646 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu parallel training        │
│                                                                                                  │
│ /data/transformers/src/transformers/trainer.py:2675 in compute_loss                              │
│                                                                                                  │
│   2672 │   │   │   labels = inputs.pop("labels")                                                 │
│   2673 │   │   else:                                                                             │
│   2674 │   │   │   labels = None                                                                 │
│ ❱ 2675 │   │   outputs = model(**inputs)                                                         │
│   2676 │   │   # Save past state if it exists                                                    │
│   2677 │   │   # TODO: this needs to be fixed and made cleaner later.                            │
│   2678 │   │   if self.args.past_index >= 0:                                                     │
│                                                                                                  │
│ /home/chenmingrui/miniconda3/envs/petals/lib/python3.10/site-packages/torch/nn/modules/module.py │
│ :1194 in _call_impl                                                                              │
│                                                                                                  │
│   1191 │   │   # this function, and just call forward.                                           │
│   1192 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1193 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1194 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1195 │   │   # Do not call functions when jit is used                                          │
│   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1197 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /home/chenmingrui/miniconda3/envs/petals/lib/python3.10/site-packages/torch/nn/parallel/data_par │
│ allel.py:157 in forward                                                                          │
│                                                                                                  │
│   154 │   │   │                                                                                  │
│   155 │   │   │   for t in chain(self.module.parameters(), self.module.buffers()):               │
│   156 │   │   │   │   if t.device != self.src_device_obj:                                        │
│ ❱ 157 │   │   │   │   │   raise RuntimeError("module must have its parameters and buffers "      │
│   158 │   │   │   │   │   │   │   │   │      "on device {} (device_ids[0]) but found one of "    │
│   159 │   │   │   │   │   │   │   │   │      "them on device: {}".format(self.src_device_obj,    │
│   160                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Expected behavior

I'm using 8×2080ti. When training using 1×2080ti and running python examples/summarize_rlhf/sft/train_gptj_summarize.py, the above code runs normally, which means the model and data can fit in only one gpu. Then I want to use data parallelism and do not use model parallelism, just like DDP. The load_in_8bit option in .from_pretrained() requires setting device_map option. With device_map='auto', it seems that the model is loaded on several gpus, as in naive model parallelism, which results in this error: RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 while training. May be setting device_map correctly should solve this problem, but I can't find how to do this in document.

younesbelkada commented 1 year ago

Hello @chenmingjiongjiong What is the VRAM of your GPU? can you alternatively try device_map={'':torch.cuda.current_device()}?

chenmingjiong commented 1 year ago

can you alternatively try device_map={'':torch.cuda.current_device()}

This solved my problem. Thanks!

Then I got another error about bitsandbytes, I have submitted an issue in their repo.

beyondguo commented 1 year ago

Wow, this is interesting! Could you explain why this trick works?

younesbelkada commented 1 year ago

Sure @beyondguo Per my understanding, and if I got it right it should very simple. device_map={"":0} simply means "try to fit the entire model on the device 0" - device 0 in this case would be the GPU-0 In a distributed setting torch.cuda.current_device() should return the current device the process is working on. If you have 4 GPUs and running DDP with 4 processes each process should be working on an independent GPU, meaning that if each process load a model with device_map={"":i} the process i will try to fit the entire model on the GPU i, this leads to properly having n working processes that have a replica of the model. I remember I had some issues while using torch.cuda.current_device() therefore now I advise users to use accelerate instead and retrieve the current process index with the following trick:

from accelerate import Accelerator

dummy_accelerator = Accelerator()
current_device = dummy_accelerator.process_index

Let me know if anything is unclear

beyondguo commented 1 year ago

Thanks @younesbelkada Now I'm using LoRA to tune a LLM (ChatGLM-6B) using 2 * A800 80G. I've got some findings that really confuse me.

The first problem:

Setting device_map="auto” to my understanding means setting model parallelization (MP), which will put the model layers into different devices. Thus, during training, only one GPU is calculating.
Setting model.is_parallelizable=False means I don't want to set MP.

However, if I both set device_map="auto” and model.is_parallelizable=False, model parallelization is still activated. I think model.is_parallelizable=False should block the model parallelization.

Second problem:

Setting device_map={'':torch.cuda.current_device()}, it means the model is copied to both GPUs.
Setting device_map="auto", I see the model to split into two parts:

However, I found the latter method consumes nearly the save GPU memories per GPU as the first method. Why? I thought it should only consume half the memories per GPU compared with the first method.

One more thing, using device_map="auto", the batch size is halved, compared with device_map={'':torch.cuda.current_device()}, however, it is even 1.5 x faster! Could you please explain why this happens? Many thanks!

younesbelkada commented 1 year ago

Hi @beyondguo Thanks for looping back 1- Yes setting device_map = auto means that you want to set Model Parallelism, meaning putting the model into different GPU layers and one GPU at a time will be used 2- I think in the latest versions of transformers this argument is not needed anymore Regarding the second problem I think this is expected, if you run things correctly if you have a copy of the model in 2 GPUs you will also have 2 copies of the optimizer states and the input data will be also split across both processes

beyondguo commented 1 year ago

Thanks for your detailed reply! @younesbelkada

To my understanding, when using device_map="auto", only a subset of all layers is allocated to one GPU, which should lead to lower GPU consumption. However, it consumes nearly the same GPU memories as setting device_map={'':torch.cuda.current_device()}.

younesbelkada commented 1 year ago

I see, thanks for your reply! Can you provide more details (how many GBs allocated, which model, etc.?) Thanks!

beyondguo commented 1 year ago

Sure. Model: ChatGLM-6B device: 4 * A800-80G

70 GBs allocated for each GPU.

The code I'm using is https://github.com/beyondguo/LLM-Tuning/blob/796384e837b3b6d70564d50ef5bb46f9175cb700/chatglm_lora_tuning.py#L87

younesbelkada commented 1 year ago

Thanks for sharing those

Model: ChatGLM-6B

I see the model is running in full precision, a 6B model would require 24GB VRAM just to be loaded on the GPU

70 GBs allocated for each GPU.

Do you run your script using torch.distributed.run or just python yourscript.py?

beyondguo commented 1 year ago

simply python yourscript.py, I'm using Trainer, which I think should automatically manage the GPU allocation.

younesbelkada commented 1 year ago

I see better now, if you want to benefit from data parallelism as mentioned here: https://github.com/huggingface/transformers/issues/21736#issuecomment-1595699638 or in the original message from the author you need 2 things:

use the main branch of transformers that contains multiple fixes of accelerate + Trainer integration
run accelerate config --> select multi GPU then run your script with accelerate launch yourscript.py. to make sure that only the main process saves the model you can add a simple check in the model.save_pretrained and do something like that instead:
```
if trainer.accelerator.is_main_process:
model.save_pretrained(training_args.output_dir)
```

beyondguo commented 1 year ago

Thanks! I will try these later.

beyondguo commented 1 year ago

Hi @younesbelkada Sorry to bother you again. I'm still working on the "device_map" thing... I'm curious how does transformers automatically allocate the layers to different GPUs.

When I load the ChatGLM-6B model, using device_map="auto", I see the layers are allocated to:

{'transformer.word_embeddings': 0,
 'lm_head': 0,           <-----
 'transformer.layers.0': 0,
 'transformer.layers.1': 0,
 'transformer.layers.2': 0,
 'transformer.layers.3': 0,
 'transformer.layers.4': 0,
 'transformer.layers.5': 1,
 'transformer.layers.6': 1,
 'transformer.layers.7': 1,
 'transformer.layers.8': 1,
 'transformer.layers.9': 1,
 'transformer.layers.10': 1,
 'transformer.layers.11': 1,
 'transformer.layers.12': 1,
 'transformer.layers.13': 1,
 'transformer.layers.14': 2,
 'transformer.layers.15': 2,
 'transformer.layers.16': 2,
 'transformer.layers.17': 2,
 'transformer.layers.18': 2,
 'transformer.layers.19': 2,
 'transformer.layers.20': 2,
 'transformer.layers.21': 2,
 'transformer.layers.22': 2,
...
 'transformer.layers.24': 3,
 'transformer.layers.25': 3,
 'transformer.layers.26': 3,
 'transformer.layers.27': 3,
 'transformer.final_layernorm': 3}

And when I change the model to ChatGLM2-6B, the allocation is:

{'transformer.embedding': 0,
 'transformer.rotary_pos_emb': 0,
 'transformer.encoder.layers.0': 0,
 'transformer.encoder.layers.1': 0,
 'transformer.encoder.layers.2': 0,
 'transformer.encoder.layers.3': 0,
 'transformer.encoder.layers.4': 0,
 'transformer.encoder.layers.5': 0,
 'transformer.encoder.layers.6': 1,
 'transformer.encoder.layers.7': 1,
 'transformer.encoder.layers.8': 1,
 'transformer.encoder.layers.9': 1,
 'transformer.encoder.layers.10': 1,
 'transformer.encoder.layers.11': 1,
 'transformer.encoder.layers.12': 1,
 'transformer.encoder.layers.13': 1,
 'transformer.encoder.layers.14': 2,
 'transformer.encoder.layers.15': 2,
 'transformer.encoder.layers.16': 2,
 'transformer.encoder.layers.17': 2,
 'transformer.encoder.layers.18': 2,
 'transformer.encoder.layers.19': 2,
 'transformer.encoder.layers.20': 2,
 'transformer.encoder.layers.21': 2,
 'transformer.encoder.layers.22': 3,
...
 'transformer.encoder.layers.25': 3,
 'transformer.encoder.layers.26': 3,
 'transformer.encoder.layers.27': 3,
 'transformer.encoder.final_layernorm': 3,
 'transformer.output_layer': 3}       <-----

My question is, the lm_head layer in ChatGLM-6B and the output_layer in ChatGLM2-6B are both the last layer of the models, but why lm_head is in cuda:0 (same as the input layer), the output_layer is put in cuda:3 (different from the input layer).

Because of this, when I train the ChatGLM-6B, every things is fine; but when I train the ChatGLM2-6B, an error occurs during the model forward pass loss computing: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument target in method wrapper_CUDA_nll_loss_forward)

Do you know what's the problem? How can I fix this? Many thanks!

update:

I have a workaround (which I think is too ugly, lol):

model.hf_device_map['transformer.output_layer'] = model.hf_device_map['transformer.embedding']
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True, device_map=model.hf_device_map)

which is to manually change the output_layer's device, and reload the model.

younesbelkada commented 1 year ago

Hi @beyondguo Thanks for the ping, and no problem at all device_map='auto' will dispatch the model evenly across all available GPUs. I think the issue you are facing is related to the fact that for the first model the weight is probably tied with the embedding layer (i.e. they are the same), hence the device of that layer being on the first GPU device. For the second model, maybe the lm_head is not tied to the embedding layer. Regarding your solution, I think it looks fine, you can probably load the first model on the meta device using init_empty_weights() context manager from accelerate and make it slightly more efficient. Thanks!

simeneide commented 1 year ago

Hey, Ive tried "everything" now, but cant get 8bit lora multi-gpu training to work. I have a minimal example here:

https://gist.github.com/simeneide/80aa37108474aa32b82cb7258778287b

Also tried the device_map={'':torch.cuda.current_device()} trick above without success. Not really sure what you are doing, @beyondguo ?

Anyone? Im getting desperate 😂

transformers==4.31
bitsandbytes==0.41.1
accelerate==0.21.0
torch == 2.0.1

younesbelkada commented 1 year ago

Hi @simeneide

Thanks for the ping, can you try out the solution proposed in this comment: https://github.com/huggingface/accelerate/issues/1840#issuecomment-1683105994 ?

simeneide commented 1 year ago

I dont hope the ping was during sleeping hours 😬

Yes, that worked. Thank you very much!

younesbelkada commented 1 year ago

Hahah no worries it wasn't ! Great that the solution worked! :D

huggingface / transformers