Open shenh10 opened 4 months ago
@qihqi since you are offcall this week, do you have time to follow up this issue?
This is caused by this line: https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/bart/modeling_bart.py#L1530C1-L1533C79
This line merges 2 parameters together for both encoder and decoder resulting 2 parameters less. So there are 2 parameters that shares the same tensor but has different name.
You can print state_dict length with len(model.state_dict())
and that is the same 261 for both models.
Semantics of named_parameters()
is unique parameters (with their name) so 2 parameters will same reference are only printed once.
When moving to device, tensors are moved using state_dict
so the _tie_weights
effect is lost.
You can get it back by running logic of _tie_weights on new_model
again.
In [28]: def _tie_weights(self):
...: if self.config.tie_word_embeddings:
...: self._tie_or_clone_weights(self.encoder.embed_tokens, sel
...: f.shared)
...: self._tie_or_clone_weights(self.decoder.embed_tokens, sel
...: f.shared)
...:
In [29]: _tie_weights(newm)
In [30]: len(list(newm.state_dict()))
Out[30]: 261
In [31]: len(list(newm.named_parameters()))
Out[31]: 259
This is caused by this line: https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/bart/modeling_bart.py#L1530C1-L1533C79
This line merges 2 parameters together for both encoder and decoder resulting 2 parameters less. So there are 2 parameters that shares the same tensor but has different name.
You can print state_dict length with
len(model.state_dict())
and that is the same 261 for both models.Semantics of
named_parameters()
is unique parameters (with their name) so 2 parameters will same reference are only printed once.When moving to device, tensors are moved using
state_dict
so the_tie_weights
effect is lost.You can get it back by running logic of _tie_weights on
new_model
again.In [28]: def _tie_weights(self): ...: if self.config.tie_word_embeddings: ...: self._tie_or_clone_weights(self.encoder.embed_tokens, sel ...: f.shared) ...: self._tie_or_clone_weights(self.decoder.embed_tokens, sel ...: f.shared) ...: In [29]: _tie_weights(newm) In [30]: len(list(newm.state_dict())) Out[30]: 261 In [31]: len(list(newm.named_parameters())) Out[31]: 259
Thank you for your response. I'm wondering why model.to()
would trigger _tie_weights
function?
model.to()
's logic can be interpreted roughly as
new_state_dict = {}
for k, v in model.state_dict():
new_state_dict[k] = v.to(device)
model.load_state_dict(new_state_dict)
So it undoes what _tie_weights
do.
🐛 Bug
Copy model to xla device affects the number of model's parameters.
To Reproduce
Steps to reproduce the behavior:
xla/benchmarks/benchmark_model.py
Expected behavior
len([param for param, value in new_model.named_parameters()])
is expected to return 259Environment