microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.38k stars 4.11k forks source link

[BUG] ZeRO-3 can't handle a new Parameter in forward #1757

Open stas00 opened 2 years ago

stas00 commented 2 years ago

Describe the bug

This model defines a new nn.Parameter in forward:

https://github.com/huggingface/transformers/blob/e923917cd975d6768d90eb49fdab6468b33b214f/src/transformers/models/m2m_100/modeling_m2m_100.py#L133-L134

and of course deepspeed is not equipped for that scenario and fails:

E               if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
E           AttributeError: 'Parameter' object has no attribute 'ds_status'
E               embed_pos = self.embed_positions(input_ids, inputs_embeds)
E             File "/home/stas/anaconda3/envs/py38-pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
E               result = forward_call(*input, **kwargs)
E             File "/home/stas/anaconda3/envs/py38-pt110/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
E               return func(*args, **kwargs)
E             File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/models/m2m_100/modeling_m2m_100.py", line 175, in forward
E               self.make_weights(max_pos + self.offset, self.embedding_dim, self.padding_idx)
E             File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/src/transformers/models/m2m_100/modeling_m2m_100.py", line 134, in make_weights
E               self.weights.requires_grad = False
E             File "/home/stas/anaconda3/envs/py38-pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1168, in __getattr__
E               return _parameters[name]
E             File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage3.py", line 150, in __getitem__
E               if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
E           AttributeError: 'Parameter' object has no attribute 'ds_status'

this param is not known by DS yet it tries to get it for some reason with ZeROOrderedDict.__getitem__ because the sub-module class was created with zero.Init.

It's important to note that it doesn't create a totally new nn.Parameter but it resizes the old one that it created at __init__ but assigns a new variable.

The same code works everywhere else but ZeRO-3.

We have this the same in all positional embedding classes. It creates a param at init, which is all good, but if at run-time a longer positional embedding is required it creates a new param of that longer length, replacing the original one.

We do the same with normal embeddings, but those get resized if needed before the first forward

I guess until now the tests didn't happen to try to extend the length of embeddings, but this one I have added to the model zoo deepspeed tests did trigger this scenario.

@jeffra started working on a solution here: https://github.com/microsoft/DeepSpeed/pull/1606 but it appears to have fallen between the cracks.

So I thought I'd document the original issue properly.

To reproduce

git clone https://github.com/huggingface/transformers
cd transformers
# this should work
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pyt tests/deepspeed/test_model_zoo.py -k test_zero_to_fp32_zero3_trans_m2m_100

# now swapping for a different model and now the test fails
perl -pi -e 's|stas/tiny-m2m_100|hf-internal-testing/tiny-random-m2m_100|' tests/deepspeed/test_model_zoo.py
CUDA_VISIBLE_DEVICES=0 RUN_SLOW=1 pyt tests/deepspeed/test_model_zoo.py -k test_zero_to_fp32_zero3_trans_m2m_100

To bypass the test and going straight for the script that fails:

deepspeed --num_nodes 1 --num_gpus 1 --master_port 10999 \
examples/pytorch/translation/run_translation.py --train_file \
tests/fixtures/tests_samples/wmt_en_ro/train.json --source_lang en \
--target_lang ro --model_name_or_path hf-internal-testing/tiny-random-m2m_100 \
--do_train --max_train_samples 4 --per_device_train_batch_size 2 \
--num_train_epochs 1 --fp16 --report_to none --overwrite_output_dir \
--deepspeed tests/deepspeed/ds_config_zero3.json --output_dir /tmp/tmp2i4fmejh \
--save_steps 1

the key difference between the model that works and one that doesn't - is that the first one has a longish positional embedding tensor and it doesn't get re-created during the first forward and all is good and then model that it fails with has a very short positional embedding and which gets then re-created at forward.

stas00 commented 2 years ago

Also related to this issue is what happens when 2 params are replaced as in this PR: https://github.com/huggingface/transformers/pull/16093

This code: https://github.com/huggingface/transformers/blob/d29a1cc18cb960909633aa0e56dfee4b0ffbd326/src/transformers/modeling_utils.py#L880-L910

stas00 commented 2 years ago

Here is another related report: https://github.com/huggingface/transformers/issues/16688 but the failure is different here:

(unformatted so that it can wrap)

RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be ({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) but got ({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).

the full traceback is in the Issue I linked to.

stas00 commented 2 years ago

At the moment the workaround is to ensure that when you instantiate the model, its config.max_position_embeddings is set to the longest seqlen, so that it doesn't need to remake the positional embeddings during forward and thus it won't create a new Parameter once it started training and everything will work.

To accomplish that you can do:

    config = AutoConfig.from_pretrained(mname, ...)
    config.max_position_embeddings = 2048 # adjust to the longer seqlen of the inputs
    model = AutoModelForSeq2SeqLM.from_pretrained(mname, config=config, ...)

This impacts quite a few other models - e.g. FSMT and others that extended positional extensions at forward time.