Input mismatch with TFDistilBert training from scratch inspite of cross checking input dimensions

DarshanDeshpande commented 3 years ago

Environment info

transformers version: 4.3.2
Platform: Colab
Python version: 3.6
PyTorch version (GPU?): None
Tensorflow version (GPU?): 2.4.1
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@jplu

Information

Model I am using (Bert, XLNet ...): TFDistilBert

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

tokenizer = tokenizers.BertWordPieceTokenizer("/content/drive/Shareddrives/Darshan's Shared Driver/NewTrainingData/Tokenizer/vocab.txt", strip_accents=False)
tokenizer.enable_padding(length=128)
tokenizer.enable_truncation(max_length=128)

def tokenize(sentence):
  sentence = sentence.numpy().decode('utf-8')
  a = tokenizer.encode(sentence)
  return tf.constant(a.ids,tf.int32), tf.constant(a.attention_mask, tf.int32)

def get_tokenized(sentence):
  return tf.py_function(tokenize, inp=[sentence], Tout=[tf.int32,tf.int32])

with open("TextFile.txt") as f:
  lines = f.readlines()

dataset = tf.data.Dataset.from_tensor_slices(lines)
dataset = dataset.map(get_tokenized, num_parallel_calls=tf.data.AUTOTUNE)

config = DistilBertConfig(vocab_size=30000)
model = TFDistilBertForMaskedLM(config)
inp1 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
inp2 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
op = model([inp1, inp2])
model = tf.keras.models.Model(inputs=[inp1, inp2], outputs=op)

model.compile(tf.keras.optimizers.Adam(1e-4))
model.fit(dataset.batch(32).prefetch(tf.data.AUTOTUNE), epochs=1)

Error:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:754 train_step
        y_pred = self(x, training=True)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:998 __call__
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/input_spec.py:207 assert_input_compatibility
        ' input tensors. Inputs received: ' + str(inputs))

    ValueError: Layer model expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=<unknown> dtype=int32>]

I have cross checked the output shape and input dimensions. If this is not the correct way then how exactly do I train a TF DistilBert model from scratch?

Expected behavior

Training should start as soon as fit is called

jplu commented 3 years ago

Hello!

As first, I can see several issues on the way you want to train the model:

The way you build your dataset is not correct. More precisely, in the tokenize function, the first element of the tuple (a.ids) is taken as the input, and the second (a.attention_mask) is taken as the label. Hence the error you get.
When you instantiate your tf.keras.models.Model you define the inputs and the outputs to be the same, this is not correct either, you have to run the model once and then give this output.

DarshanDeshpande commented 3 years ago

@jplu I realized my mistake and I changed the code to this

def tokenize(sentence):
  sentence = sentence.numpy().decode('utf-8')
  a = tokenizer.encode(sentence)
  return tf.constant(a.ids,tf.int32), tf.constant(a.attention_mask, tf.int32)

def get_tokenized(sentence):
  return tf.py_function(tokenize, inp=[sentence], Tout=[tf.int32,tf.int32])

def get_tokenized_final(a,b):
  return (a,b), None

dataset = tf.data.Dataset.from_tensor_slices(lines)
dataset = dataset.map(get_tokenized, num_parallel_calls=tf.data.AUTOTUNE).map(get_tokenized_final, num_parallel_calls=tf.data.AUTOTUNE)

import tensorflow as tf

config = DistilBertConfig(vocab_size=30000)
model = TFDistilBertForMaskedLM(config)
inp1 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
inp2 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
op = model([inp1,inp2])
model = tf.keras.models.Model(inputs=[inp1, inp2], outputs=model.output)

Now the model throws two warnings

WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.

and then throws the final error

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:805 train_function  *
        return step_function(self, iterator)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:795 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:1259 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:3417 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:788 run_step  **
        outputs = model.train_step(data)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:757 train_step
        self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:498 minimize
        return self.apply_gradients(grads_and_vars, name=name)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:598 apply_gradients
        grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:79 filter_empty_gradients
        ([v.name for _, v in grads_and_vars],))

    ValueError: No gradients provided for any variable: ['tf_distil_bert_for_masked_lm_1/distilbert/embeddings/word_embeddings/weight:0', 'tf_distil_bert_for_masked_lm_1/distilbert/embeddings/position_embeddings/embeddings:0', 'tf_distil_bert_for_masked_lm_1/distilbert/embeddings/LayerNorm/gamma:0', 'tf_distil_bert_for_masked_lm_1/distilbert/embeddings/LayerNorm/beta:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/q_lin/kernel:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/q_lin/bias:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/k_lin/kernel:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/k_lin/bias:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/v_lin/kernel:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/v_lin/bias:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/out_lin/kernel:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/attention/out_lin/bias:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/sa_layer_norm/gamma:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/sa_layer_norm/beta:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/ffn/lin1/kernel:0', 'tf_distil_bert_for_masked_lm_1/distilbert/transformer/layer_._0/ffn/lin1/bias:0', 'tf_distil_bert_for_masked_lm_1/distilbert/tra...

Any idea what I am doing wrong?

jplu commented 3 years ago

You cannot do model.output, as said in my previous message you have to run the model once to get how the output looks like :)

DarshanDeshpande commented 3 years ago

@jplu Could you tell me exactly what you mean by "run" the model? If I pass a sample array with all ones, it gives me a Broadcasting error as follows

config = DistilBertConfig(vocab_size=30000)
model = TFDistilBertForMaskedLM(config)
inp1 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
inp2 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
_ = model([inp1,inp2])

# Error is thrown for this call
a = tf.ones((128,),dtype=tf.int32)
model((a,a))

Error is as attached

InvalidArgumentError: Incompatible shapes: [512,768] vs. [128,768] [Op:BroadcastTo]

More specifically the error is raised in modeling_tf_distilbert.py

    183         if position_ids is None:
--> 184             position_embeds = self.position_embeddings(position_ids=inputs_embeds)
    185         else:
    186             position_embeds = self.position_embeddings(position_ids=position_ids)

If by "run" you mean calling fit on the model then it raises the same gradient error

jplu commented 3 years ago

Here a dummy example:

import tensorflow as tf
from transformers import TFDistilBertForMaskedLM, DistilBertTokenizer, DistilBertConfig

config = DistilBertConfig(vocab_size=30000)
model = TFDistilBertForMaskedLM(config)
inp1 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
inp2 = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
output = model([inp1,inp2])
model = tf.keras.models.Model(inputs=[inp1,inp2], outputs=[output])
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
data = tokenizer(["Hello1", "Hello2", "Hello3"], truncation=True, max_length=128, padding="max_length", return_tensors="tf")
labels = tf.ones((3, 128), dtype=tf.int32)
X = tf.data.Dataset.from_tensor_slices((dict(data), labels)).batch(1)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer="adam")
model.fit(X, epochs=1)

DarshanDeshpande commented 3 years ago

@jplu Thanks for this but this tokenizes the data and then loads it as a tf.data.Dataset. I was looking for an implementation where the tokenization can be integrated in the pipeline itself and can be done on the fly. I found this issue on tensorflow but there are no fixes for it yet. Do you have any idea how to do this because my dataset is big enough to fit in colab memory but cannot be fully tokenized in memory?

jplu commented 3 years ago

Sorry you cannot do this.

DarshanDeshpande commented 3 years ago

Okay. Thanks for all the help!

huggingface / transformers