huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.57k stars 26.91k forks source link

"RuntimeError: 'weight' must be 2-D" training with DeepSpeed #24643

Closed ZizoAdam closed 1 year ago

ZizoAdam commented 1 year ago

System Info

Who can help?

@pacman100 @sgugger

Information

Tasks

Reproduction

The dataset being used is my own dataset that is just a few hundred strings in a CSV file produced by pandas.

Running the following code

from transformers import GPTJForCausalLM, AutoTokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
import os
from torch.utils.data import Dataset
import pandas as pd
import evaluate
import numpy as np
import sklearn
import torch as nn
from transformers.trainer_pt_utils import get_parameter_names

model_name = "EleutherAI/gpt-j-6b"

d_type = "auto"

print("CUDA Available: "+ str(nn.cuda.is_available()))
print("CUDA Version: " + str(nn.version.cuda))
print("GPUs Available: "+ str(nn.cuda.device_count()))

def process_csv(filename, tknizer):
    data = pd.read_csv(filename)
    return tknizer(list(data["text"].values.flatten()), padding=True, truncation=True, return_tensors="pt")

tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=d_type)
collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
tokenizer.pad_token = tokenizer.eos_token

class MyDataset(Dataset):
    def __init__(self, tokenized_input):
        self.tokenized_input = tokenized_input

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.tokenized_input.items()}

    def __len__(self):
        return len(self.tokenized_input.input_ids)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

train_data = MyDataset(process_csv("train_data.csv", tokenizer))

eval_data = MyDataset(process_csv("test_data.csv", tokenizer))

training_args = TrainingArguments(
    output_dir="test_trainer",
    deepspeed="deepSpeedCPU.json",
)

model = GPTJForCausalLM.from_pretrained(model_name, torch_dtype=d_type).cuda()

print("Total Memory: " + str(nn.cuda.get_device_properties(0).total_memory))
print("Reserved: " + str(nn.cuda.memory_reserved(0)))
print("Allocated: " + str(nn.cuda.memory_allocated(0)))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

trainer.train()

using the following config file

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Causes an error at trainer.train()

Traceback (most recent call last):
  File "/home/augustus/ADAM/main2.py", line 82, in <module>
    trainer.train()
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 854, in forward
    transformer_outputs = self.transformer(
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 634, in forward
    inputs_embeds = self.wte(input_ids)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

Expected behavior

I would expect training to begin or a more verbose error to help fix the issue (if possible to do so from my side)

ydshieh commented 1 year ago

Hi

While waiting @pacman100 's comment maybe , you can check what's the shape of self.wte. It would be a good idea to double check if the issue also happens without the usage of deepspeed.

  File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 634, in forward
    inputs_embeds = self.wte(input_ids)
ZizoAdam commented 1 year ago

The issue does not happen without deepspeed, however we are unable to train without deepspeed due to not having much in the way of system resources.

pacman100 commented 1 year ago

DeepSpeed version and how are you launching the script?

ZizoAdam commented 1 year ago

Deepspeed 0.9.5, just launching it with python3 script.py

pacman100 commented 1 year ago

Thought so, please use distributed launcher such as torchrun, deepspeed or accelerate when using DeepSpeed/DDP/FSDP or anytime you are doing distributed training.

Please refer:

  1. https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-with-multiple-gpus
  2. https://huggingface.co/docs/transformers/main/en/main_classes/trainer#using-accelerate-launcher-with-trainer
pacman100 commented 1 year ago

that should resolve the issue

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yuxyang88 commented 1 year ago

i also have the same problem. also deepspeed stage3 with trainner. @ZizoAdam do u solve the problem?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 1 year ago

@yuxyang88 if the solution did not work for you, feel free to open a new issue with a reproducer (as small as possible) making sure you are using the lastest version of transformers.

nomadlx commented 1 year ago

因此,请使用分布式启动器,例如torchrundeepspeedaccelerate在使用 DeepSpeed/DDP/FSDP 时或在进行分布式训练时使用。

请参考:

  1. https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-with-multiple-gpus
  2. https://huggingface.co/docs/transformers/main/en/main_classes/trainer#using-accelerate-launcher-with-trainer

My program reported the same error (RuntimeError: 'weight' must be 2-D), but I started the distributed training with deepspeed, I do not understand your answer, why do you think it can solve the problem?

ydshieh commented 1 year ago

Hi @nomadlx

Please open a new issue with a reproducer (as small as possible but complete).

Also making sure you are using the lastest version of transformers / accelerate too.

Thanks.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Hagtaril commented 10 months ago

我也遇到了相同问题,在我将transformers的版本从4.35.0换到4.31.0之后问题解决了

ChenYang-ChenYang commented 10 months ago

I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.

xipq commented 10 months ago

I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.

I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:

1) Make sure calling deepspeed correcty (e.g. deepspeed --num_gpus <> --master_port=<> xxx.pywhen launching the training job. This should solve most of the cases if you are just training a single model.

2) For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with from_pretrained before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL or copy.deepcopy() all yields incorrect deepspeed initializations. You may see error like this:

3) These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.

vancoykendall commented 9 months ago

I got the "weight" must be 2-D" issue using zero 3 with the TRL library to do DPO. I was also using the PEFT library to add two LoRA adapters to the model (one for the reference and one for the trained model).

Solution: I removed the embedding layer as a target module in the LoRA configs and it worked. I'm not sure why, but since the stack trace had

File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

I just tried removing it

cxjtju commented 7 months ago

I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.

I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:

  1. Make sure calling deepspeed correcty (e.g. deepspeed --num_gpus <> --master_port=<> xxx.pywhen launching the training job. This should solve most of the cases if you are just training a single model.
  2. For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with from_pretrained before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL or copy.deepcopy() all yields incorrect deepspeed initializations. You may see error like this:
  • Tensors must be 2-D
  • AssertionError: {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {456}, 'ds_tensor.shape': torch.Size([0])} : {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_m
  1. These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.

How to fix the bug "Tensors must be 2-D"?

xipq commented 7 months ago

I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.

I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:

  1. Make sure calling deepspeed correcty (e.g. deepspeed --num_gpus <> --master_port=<> xxx.pywhen launching the training job. This should solve most of the cases if you are just training a single model.
  2. For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with from_pretrained before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL or copy.deepcopy() all yields incorrect deepspeed initializations. You may see error like this:
  • Tensors must be 2-D
  • AssertionError: {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {456}, 'ds_tensor.shape': torch.Size([0])} : {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_m
  1. These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.

How to fix the bug "Tensors must be 2-D"?

Initialize each model (reference and policy) with from_pretrained before passing to Huggingface trainer classes