[Deepspeed][initialization] pegasus: unable to load/init the weights

sajastu commented 3 years ago

Environment info

transformers version: 4.9.0.dev0
Platform: Ubuntu
Python version: 3.8
PyTorch version (GPU?): Y
Using GPU in script?: Y
Using distributed or parallel set-up in script?: Y - Deepspeed version: deepspeed 0.4.1 (installed with pip)

@stas00,

Information

I'm trying to fine-tuned pegasus-large model using deepspeed with multi-gpu. It seems that deepspeed is unable to initialize the weights in the beginning. While, I removed deepspeed and weights seem to be properly initialized. I'm hesitating if this is a bug with deepspeed library. Details are given below.

The command:

deepspeed --num_gpus=8 examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path google/pegasus-large \
    --do_train \
    --do_eval \
    --do_predict \
    --output_dir /home/code-base/user_space/saved_models/pegasus/reddit-xsum-1024-tuned/ \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=4  \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --adam_beta2 0.98 \
    --num_train_epochs 10 \
    --overwrite_output_dir \
    --predict_with_generate \
    --evaluation_strategy steps  --eval_steps 1000 --save_steps 1000 --warmup_steps 10000 \
    --text_column document \
    --summary_column summary \
    --train_file $DS_BASE_DIR_P/train.json \
    --validation_file $DS_BASE_DIR_P/validation.json \
    --test_file $DS_BASE_DIR_P/test.json \
    --deepspeed ds_config.json

Error message:

...
Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 617, in <module>
    main()
  File "examples/pytorch/summarization/run_summarization.py", line 355, in main
    model = AutoModelForSeq2SeqLM.from_pretrained(
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/auto/auto_factory.py", line 395, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/modeling_utils.py", line 1176, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1209, in __init__
    self.model = PegasusModel(config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1082, in __init__
    self.encoder = PegasusEncoder(config, self.shared)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 652, in __init__
    self.embed_positions = PegasusSinusoidalPositionalEmbedding(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 114, in __init__
    self.weight = self._init_weight(self.weight)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 122, in _init_weight
    n_pos, dim = out.shape
ValueError: not enough values to unpack (expected 2, got 1)
Killing subprocess 3351
Killing subprocess 3352
Killing subprocess 3353
Killing subprocess 3354
Killing subprocess 3355
Killing subprocess 3356
Killing subprocess 3357
Killing subprocess 3358
...

ds_config.json is Zero3 copied from the repository.
I checked self.out: with deepspeed its shape is [1] and only contains a 1-d tensor with value 1. However, in single-gpu env, the shape is [1024, 1024] which contains floating numbers (i.e., much like embeddings).

The problem arises when using:

[ x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ x] my own task or dataset: (give details below) --reddit_tifu_long

To reproduce

Steps to reproduce the behavior:

Running the above command with deepspeed.

stas00 commented 3 years ago

Thank you for the report, @sajastu

Could you please adjust the command line in your report so that it uses some small public dataset and not custom files which we don't have?

Then I will sort it out.

Thank you.

sajastu commented 3 years ago

Sure thing! @stas00

Please let me modify the script, and then test so that it runs flawlessly. I'll give you an update shortly!

stas00 commented 3 years ago

I was able to reproduce the problem with:

export BS=16; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 deepspeed --num_gpus=2  \
examples/pytorch/summarization/run_summarization.py --model_name_or_path  \
google/pegasus-cnn_dailymail --output_dir output_dir --adam_eps 1e-06 --do_train --label_smoothing  \
0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 500 --max_source_length 128  \
--max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size  \
$BS --predict_with_generate --sortish_sampler --dataset_name cnn_dailymail --dataset_config "3.0.0"  \
--val_max_target_length 128 --warmup_steps 50 --max_train_samples 50 --max_eval_samples 50  \
--deepspeed tests/deepspeed/ds_config_zero3.json

So nothing else needs to be done by your side.

stas00 commented 3 years ago

so the quick fix is:

--- a/src/transformers/models/pegasus/modeling_pegasus.py
+++ b/src/transformers/models/pegasus/modeling_pegasus.py
@@ -26,6 +26,7 @@ from torch import nn
 from torch.nn import CrossEntropyLoss

 from ...activations import ACT2FN
+from ...deepspeed import is_deepspeed_zero3_enabled
 from ...file_utils import (
     add_end_docstrings,
     add_start_docstrings,
@@ -109,7 +110,13 @@ class PegasusSinusoidalPositionalEmbedding(nn.Embedding):

     def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None):
         super().__init__(num_positions, embedding_dim)
-        self.weight = self._init_weight(self.weight)
+        if is_deepspeed_zero3_enabled():
+            import deepspeed
+            with deepspeed.zero.GatheredParameters(self.weight, modifier_rank=0):
+                self.weight = self._init_weight(self.weight)
+        else:
+            self.weight = self._init_weight(self.weight)
+

     @staticmethod
     def _init_weight(out: nn.Parameter):

Let me know if you can handle the diff.

I will work on a normal PR and test. Ideally should think of something that requires less code changes, but it will do the right thing for now.

sajastu commented 3 years ago

@stas00 Thanks. It works perfectly now!

stas00 commented 3 years ago

thank you for validating that it works for you.

I'm trying to have this solved on the deepspeed side, so that all our models will work w/o needing to change each one of them separately. so I will keep you posted on the progress.

stas00 commented 3 years ago

If you want to try the fix on the deepspeed side, instead of the workaround on transformers side, you can try this branch: https://github.com/microsoft/DeepSpeed/pull/1202

stas00 commented 3 years ago

https://github.com/microsoft/DeepSpeed/pull/1202 has been merged, so if you use the master version of deepspeed, you no longer need the workaround I shared with you.

I will close this, but if you still encounter any problems please feel free to re-open.

huggingface / transformers