Difference in embedding weight initialization for randomly initialized T5 model

System Info

transformers

Who can help?

@ArthurZucker

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The problem is technical, so I will describe it here. I believe the idea is to keep the weight initialization the same for pytorch or tf models initialized from scratch. However, this is different.

In https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L821 the embedding weights are initialized with a variance of 1. However, in tf, this is done by initializing with a standard deviation of 0.05. https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1635

https://www.tensorflow.org/api_docs/python/tf/random_normal_initializer

According to the docs, it's default initialized with these arguments:

tf.random_normal_initializer(
    mean=0.0, stddev=0.05, seed=None
)

PyTorch initialization:

            # Mesh TensorFlow embeddings initialization
            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)

TF initialization:

def embedding_weights(mesh,
                      vocab_dim,
                      output_dim,
                      variable_dtype,
                      name="embedding",
                      ensemble_dim=None,
                      initializer=None):
  """Embedding weights."""
  shape = mtf.Shape(
      [ensemble_dim] if ensemble_dim else []) + [vocab_dim, output_dim]
  if initializer is None:
    initializer = tf.random_normal_initializer()
  ret = mtf.get_variable(
      mesh, name, shape, dtype=variable_dtype, initializer=initializer)
  return ret

This is already mentioned in #16749 but since #16749 mentions 2 issues this first one seems to have gone unnoticed, so I am opening a separate issue for this one.

Expected behavior

I expect the initialization to be the same across TF and PyTorch T5 models.

huggingface / transformers