Open stas00 opened 2 years ago
Also related to this issue is what happens when 2 params are replaced as in this PR: https://github.com/huggingface/transformers/pull/16093
Here is another related report: https://github.com/huggingface/transformers/issues/16688 but the failure is different here:
(unformatted so that it can wrap)
RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be ({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) but got ({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).
the full traceback is in the Issue I linked to.
At the moment the workaround is to ensure that when you instantiate the model, its config.max_position_embeddings
is set to the longest seqlen
, so that it doesn't need to remake the positional embeddings during forward
and thus it won't create a new Parameter
once it started training and everything will work.
To accomplish that you can do:
config = AutoConfig.from_pretrained(mname, ...)
config.max_position_embeddings = 2048 # adjust to the longer seqlen of the inputs
model = AutoModelForSeq2SeqLM.from_pretrained(mname, config=config, ...)
This impacts quite a few other models - e.g. FSMT and others that extended positional extensions at forward
time.
Describe the bug
This model defines a new
nn.Parameter
in forward:https://github.com/huggingface/transformers/blob/e923917cd975d6768d90eb49fdab6468b33b214f/src/transformers/models/m2m_100/modeling_m2m_100.py#L133-L134
and of course deepspeed is not equipped for that scenario and fails:
this param is not known by DS yet it tries to get it for some reason with
ZeROOrderedDict.__getitem__
because the sub-module class was created withzero.Init
.It's important to note that it doesn't create a totally new
nn.Parameter
but it resizes the old one that it created at__init__
but assigns a new variable.The same code works everywhere else but ZeRO-3.
We have this the same in all positional embedding classes. It creates a param at init, which is all good, but if at run-time a longer positional embedding is required it creates a new param of that longer length, replacing the original one.
We do the same with normal embeddings, but those get resized if needed before the first forward
I guess until now the tests didn't happen to try to extend the length of embeddings, but this one I have added to the model zoo deepspeed tests did trigger this scenario.
@jeffra started working on a solution here: https://github.com/microsoft/DeepSpeed/pull/1606 but it appears to have fallen between the cracks.
So I thought I'd document the original issue properly.
To reproduce
To bypass the test and going straight for the script that fails:
the key difference between the model that works and one that doesn't - is that the first one has a longish positional embedding tensor and it doesn't get re-created during the first forward and all is good and then model that it fails with has a very short positional embedding and which gets then re-created at
forward
.