bpraveenk commented 3 years ago

Here is a description of series of errors I encountered while fine-tuning gpt2 pre-trained model using run_glue.py (which were also reported here). I am also mentioning here the code fixes I had to make to fix these errors. If the custodians of the code-base are happy with the changes, I will be glad to check the changes in if the set of instructions to submit the patch, get it reviewed and checkin are shared with me.

Environment info

transformers version: 4.10.0.dev0
Platform: Linux-5.4.0-1051-azure-x86_64-with-glibc2.10
Python version: 3.8.1
PyTorch version (GPU?): 1.9.0
Tensorflow version (GPU?): 2.3.0
Using GPU in script?: Yes (1 gpu)
Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten, @sgugger, @patil-suraj

Model I am using (Bert, XLNet ...): GPT2

The problem arises when using:

[ ] the official example scripts: (give details below) examples/tensorflow/text-classification/run_glue.py

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: GLUE

To reproduce

Steps to reproduce the behavior: (applicable to any GLUE classification task)

python run_glue.py --model_name_or_path gpt2 --task_name sst2 --do_train --do_eval --do_predict --output_dir ./output

Error 1 File "run_glue.py", line 567, in main() File "run_glue.py", line 415, in main optimizer = tf.keras.optimizers.Adam( File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/adam.py", line 115, in init super(Adam, self).init(name, **kwargs) File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 335, in init raise ValueError("Gradient clipping in the optimizer " ValueError: Gradient clipping in the optimizer (by setting clipnorm or clipvalue) is currently unsupported when using a distribution strategy.

Fix Don't set the clipnorm parameter

clipnorm=training_args.max_grad_norm,

Error 2 ValueError: in user code: /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:806 train_function * return step_function(self, iterator) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:796 step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/distribute/one_device_strategy.py:184 run return super(OneDeviceStrategy, self).run(fn, args, kwargs, options) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:1211 run return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py:2585 call_for_each_replica return self._call_for_each_replica(fn, args, kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/distribute/one_device_strategy.py:367 _call_for_each_replica return fn(args, kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:789 run_step outputs = model.train_step(data) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:748 train_step loss = self.compiled_loss( /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/engine/compile_utils.py:204 call loss_value = loss_obj(y_t, y_p, sample_weight=sw) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/losses.py:149 call losses = ag_call(y_true, y_pred) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/losses.py:253 call return ag_fn(y_true, y_pred, self._fn_kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper return target(args, kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/losses.py:1566 sparse_categorical_crossentropy return K.sparse_categorical_crossentropy( /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper return target(*args, *kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/keras/backend.py:4790 sparse_categorical_crossentropy return array_ops.reshape(res, output_shape[:-1]) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py:201 wrapper return target(args, **kwargs) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:195 reshape result = gen_array_ops.reshape(tensor, shape, name) /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/ops/gen_arrayops.py:8233 reshape , _, _op, _outputs = _op_def_library._apply_op_helper( /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:742 _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:591 _create_op_internal return super(FuncGraph, self)._create_op_internal( # pylint: disable=protected-access /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:3477 _create_op_internal ret = Operation( /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:1974 init self._c_op = _create_c_op(self._graph, node_def, inputs, /anaconda/envs/azureml_py38/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:1815 _create_c_op raise ValueError(str(e)) ValueError: Dimension size must be evenly divisible by 192 but is 8 for '{{node sparse_categorical_crossentropy_2/Reshape_2}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32](sparse_categorical_crossentropy_2/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits, sparse_categorical_crossentropy_2/strided_slice_1)' with input shapes: [8], [4] and with input tensors computed as partial shapes: input[1] = [2,8,12,?].

Fix It looks the like call to TFGPT2ForSequenceClassification return logits in shape (batch_size, sequence_length, num_labels), which is causing the above error.

After pooled_logits are computed, add the following line to extract the logits from last step of the sequence pooled_logits = pooled_logits[:, -1, :]

and change return TFSequenceClassifierOutputWithPast( loss=loss, logits=pooled_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions, )

to return TFSequenceClassifierOutputWithPast( logits=pooled_logits, )

Expected behavior

Successful completion of training and evaluation

patrickvonplaten commented 3 years ago

Hey @bpraveenk,

could you attach a google colab to reproduce the error here? Pinging @Rocketknight1 for TF here.

Rocketknight1 commented 3 years ago

TF maintainer here! I reproduced the second error but not the first - but the second one seems like a much more serious problem anyway. The problem does not occur for me in any other models I tested except GPT2, but possibly there are other CLM models where this occurs. My suspicion is that the bug is in our TFGPT2ForSequenceClassification code, not in run_glue.py. Although you can write some code in run_glue.py to work around it, this might break other models that are currently working, like BERT.

Either way, thank you for finding this! If you want to try to fix this yourself, please let me know, and ask any questions you like. Please make sure that any fixes you submit also work with MLM models like bert-base-uncased as well as gpt2 though!

Rocketknight1 commented 3 years ago

Hey, actually, on further examination, I think the issue is that all CLM-trained models return outputs with past states. Therefore, all we need to do is check whether the output is an instance of TFSequenceClassifierOutputWithPast, in run_glue.py, and if so, to take pooled_logits[:, -1, :] as you suggested, and we shouldn't need to modify the GPT-2 code at all.

bpraveenk commented 3 years ago

Thank you @Rocketknight1 for your prompt response. I am glad I could help!

After going over this tensorflow issue, I guess the first error was probably resolved in later version of tensorflow-2.x. Could you share the version of tf that you are using to reproduce the error?

Regarding error 2, adding pooled_logits = pooled_logits[:, -1, :] alone did not work for me. I had to remove the past states (see below) from the return object for training to proceed successfully. I recommend running the code in tensorflow-eager mode to see more descriptive error. The change I made is specific to GPT2 classification model and it didn't affect fine-tuning/training other models, e.g., bert-base-uncased, which I used to test the change.

return TFSequenceClassifierOutputWithPast( logits=pooled_logits, )

Just curious, would your proposed solution to check the instance of the output (e.g., TFSequenceClassifierOutputWithPast) in run_glue.py work with model.fit? Since the change is in run_glue.py, perhaps we should test the solution to make sure it works with other models too.

On a related note, what are your thoughts on using a flag to control the inclusion of past-states and loss in the GPT2Classification model forward-pass output?

I am happy to fix the bug. Could you please point me to the document which includes steps to run relevant unit-tests, submit a patch and get it reviewed by the maintainers before its merged?

Rocketknight1 commented 3 years ago

Hi @bpraveenk! I was using TF 2.5, which might explain why I didn't see the first error.

However, you're correct that the fix I suggested won't work with model.fit, so we would need some way to get CLM models to stop returning those past states. I'm going to check with the rest of the team about whether returning TFSequenceClassifierOutputWithPast is intended in this case, and what we can do about it. If we decide a flag like you suggested is appropriate, I'd be happy to work with you on implementing that.

Also, this isn't really relevant, but can I ask why you want to use a CLM model like GPT-2 for sequence classification instead of a more normal MLM model? It's definitely something we should be supporting, but it's still quite rare, so I'm curious to know what your use-case is there!

bpraveenk commented 3 years ago

Thank you @Rocketknight1 for your detailed response. I was curious to benchmark the performance of GPT2 against other LMs on classification tasks.

Rocketknight1 commented 3 years ago

That's interesting - my intuition is that it will do worse than MLMs, though it has the advantage of being quite a large model. That said, we're adding some equally-big MLM models to the hub, including a TF port of DeBERTaV2 in the next few days, which would be an interesting point of comparison. I'd love to see your benchmark results when they're ready!

bpraveenk commented 3 years ago

It's indeed exciting to hear that large MLM model will be made available! For discriminative and generative model performance comparison I am planning to use BART (encoder-decoder) model as well. Do I have to write custom code to fine-tune BART model on GLUE tasks or can I use run_glue.py?

Rocketknight1 commented 3 years ago

BART is a Seq2Seq model, and I'm not sure if we have a TF implementation of a sequence classifier head for it, unfortunately. You might have to build your own model, starting from TFBartModel and then adding a classifier head on top.

Clara-breado commented 3 years ago

It seems that pass the pad_token_id works too? I met the same problem today when I want to build a classifier head on the TFGPT2Model, I try to follow the source code in modeling_tf_gpt2.py to build a dense layer after the transformer(which is the gpt2 in this case), but I forgot this step:in_logits = tf.gather(logits, sequence_lengths, batch_dims=1, axis=1), when I use fit function, the bug occurred(shape mismatch). Thanks @bpraveenk @Rocketknight1 you do me big favor to fix the bug. Now, I use dense()[:,-1,:] instead of dense() as the outputs , and it can fit now. But I still hold a concern about why they get the different output between TFAutoModelForSequenceClassification and my model( TFGPT2Model + dense(I copy the 'score' parameter from modeling_tf_gpt2.py)[:,-1,:], is it because of the weights? which I haven't trained. ( But I guess the weight of the score hasn't trained in the TFAutoModelForSequenceClassification...)

my custom model: `from tensorflow.keras.layers import Dense import tensorflow as tf input_ids = tf.keras.layers.Input(shape=(128,), name='input_ids', dtype='int32') attention_mask = tf.keras.layers.Input(shape=(128,), name='attention_mask', dtype='int32') embeddings = gpt2_hf(input_ids=input_ids,attention_mask=attention_mask)[0]

score = tf.keras.layers.Dense(112,kernel_initializer=tf.initializers.TruncatedNormal(config.initializer_range),name="score",use_bias=False,)(embeddings)[:,-1,:]

model = tf.keras.Model(inputs=[input_ids,attention_mask], outputs=score,name='GPT2_Multiclass') `

pad token id ` if self.config.pad_token_id is None: sequence_lengths = -1 else: if inputs["input_ids"] is not None: sequence_lengths = ( tf.reduce_sum( tf.cast( tf.math.not_equal(inputs["input_ids"], self.config.pad_token_id), dtype=inputs["input_ids"].dtype, ), -1, keepdims=False, )

1 ) in_logits = tf.gather(logits, sequence_lengths, batch_dims=1, axis=1)`

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

GPT2 for classification - Errors encountered while running run_glue.py and (possible) fixes #13288

Environment info

Who can help

To reproduce

clipnorm=training_args.max_grad_norm,

Expected behavior