huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.33k stars 26.86k forks source link

Finue-tuning T5 model #4092

Closed Palipoor closed 4 years ago

Palipoor commented 4 years ago

Hi, I want to fine-tune T5 for a seq2seq task and I'm using the T5ForConditionalGeneration as it seems to have an LM decoder on top. As there's no code example for this, I have lots of questions:

  1. Am I doing the right thing?
  2. I'm using the Adam optimizer. Is it ok?
  3. I'm a bit confused about the forward inputs in the training phase. I read this explanation over and over again and I don't understand whether I should just use input_ids and lm_labels for the training or not. Also somewhere in this issue someone's mentioned that:

    T5 input sequence should be formatted with [CLS] and [SEP] tokens

So which one is right? I'm super confused.

amitness commented 4 years ago

+1. I'm also confused on how to structure the lm_labels and the decoder_input_ids.

enzoampil commented 4 years ago

Given T5's universal text-to-text objective, I'm under the impression that the T5 summarization example should be applicable for all T5 tasks, as long as the input and target sequences are correctly structured for the specified task. Hope this can be confirmed!

Sample input and target structures for specific tasks can be found at Appendix D in the T5 paper.

patrickvonplaten commented 4 years ago

To correctly train T5 one should follow the instructions at https://huggingface.co/transformers/model_doc/t5.html#training .

For training, there is no need to provide the decoder_input_ids - they are created automatically. One only has to provide the lm_labels.

As @enzoampil, Appendix D of the paper gives good input/output examples.

amitness commented 4 years ago

@patrickvonplaten What exactly would be the lm_labels for something like summarization?

Example Usecase Text: "ABC" with maximum length 500 Summary: "XYZ" with maximum length 50

I understand that we can prepare input_ids and attention_mask like this for the document.

x = tokenizer.encode_plus(sentence, 
                          max_length=500, 
                          pad_to_max_length=True, 
                          return_tensors='pt')

Now for the lm_labels i.e. summary, is simply doing this enough?

lm_labels = tokenizer.encode(summary,  
                            return_tensors='pt', 
                            max_length=50, 
                            pad_to_max_length=True)

And the model as

model = T5ForConditionalGeneration.from_pretrained('t5-small')
model(input_ids=..., lm_labels=lm_labels, attention_mask=...)

In your examples folder for summarization, I've seen some preprocessing like this for lm_labels. I didn't understand why this is being done.

y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
patrickvonplaten commented 4 years ago

Hi @amitness,

For T5 summarization you will have to append the prefix "summarize: " to every input data. But you are more or less right. All you have to do is:

  1. Prepare input data
    x = tokenizer.encode_plus("summarize: " + sentence, 
                          max_length=500, 
                          pad_to_max_length=True, 
                          return_tensors='pt')
  2. Prepare labels
    lm_labels = tokenizer.encode_plus(summary,  
                            return_tensors='pt', 
                            max_length=50, 
                            pad_to_max_length=True)
  3. For tokens that are padded (which is only relevant if you train with batch_size > 1) you need to make sure that no loss is calculated on those tokens, so
    lm_labels[lm_labels == tokenizer.pad_token_id] = -100

There is no need to shift the tokens as you show at the end of your comment because T5 does that automatically - see https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_t5.py#L1063.

This is also explained in https://huggingface.co/transformers/model_doc/t5.html#training .

enzoampil commented 4 years ago

Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side 😄

Gotcha for me was that the decoder_input_ids at inference should be prepended by the padding token as stated in the docs for T5ForConditionalGeneration.

amitness commented 4 years ago

@enzoampil Can you give an example code of what you meant by prepending padding token at inference time?

amitness commented 4 years ago

@patrickvonplaten Thank you.

Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?

enzoampil commented 4 years ago

@amitness

E.g. in your summarization case, it would look something like:

from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")
decoder_input_ids = tokenizer.encode("<pad>", return_tensors="pt") 
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
outputs[0]

Do note that T5ForConditionalGeneration already prepends the padding by default. Above is only necessary if you're doing a forward pass straight from T5Model.

Regarding your question about making your own prefix, yes, you should be able to train on your own prefix. This is the whole point of T5's text-to-text approach. You should be able to specify any problem through this kind of approach (e.g. Appendix D in the T5 paper).

amitness commented 4 years ago

@enzoampil Makes sense. Thank you so much.

patrickvonplaten commented 4 years ago

@patrickvonplaten Thank you.

Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?

Sure, you can train with your own prefix.

patrickvonplaten commented 4 years ago

Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side

Gotcha for me was that the decoder_input_ids at inference should be prepended by the padding token as stated in the docs for T5ForConditionalGeneration.

Yeah that's actually a bit hidden in the code. So to clarify: During training, there is no need to prepend the padding token since this is done automatically in T5 when lm_labels is provided. During evaluation, one has to prepend the PAD token as you stated in your example.

After training, the mode can be used with the generate() method (which actually powers the summarization, translation and text-generation pipeline). In the generate() method, the padding token is automatically prepended.

amitness commented 4 years ago

@patrickvonplaten One thing I've noticed is the discrepancy between huggingface's and the original google-research tokenization.

In the official colab by the paper authors, they seem to add </s> when tokenizing to the end of each text. But, when we use tokenizers from hugging face, it is not added. Not sure if it is a problem or not. Here is an excerpt from their official colab

'inputs_plaintext': b'trivia question: what is the population of fayetteville north carolina?', 'inputs': array([22377,   822,    10,   125,    19,     8,  2074,    13,     3,
          89,     9,    63,  1954,  1420,  3457,   443, 12057,     9,
          58,     1])

You can see 1 added at the end of the token_ids. But if we tokenize this same sentence with huggingface tokenizer, we don't get 1 at end.

tokenizer.encode('trivia question: what is the population of fayetteville north carolina?')
# [22377,   822,    10,   125,    19,     8,  2074,    13,     3, 89,     9,    63,  1954,  1420,  3457,   443, 12057,     9, 58]

When I was prototyping with the models, I tried preparing data like this to solve it. This adds 1 to the end. Not sure if we need to do this or not.

tokenizer.encode("summarize: Hello world</s>", return_tensors="pt")
patrickvonplaten commented 4 years ago

Yes you are right, you should add the </s> token to the end of a sentence. I think this is also shown in the docs: https://huggingface.co/transformers/model_doc/t5.html#training.

Palipoor commented 4 years ago

Thanks to @patrickvonplaten for all clarification and others for their further questions that led to more details on the subject.

keleog commented 4 years ago

Hello everyone,

I am currently working on finetuning the TFT5ForConditionalGeneration model on a parallel dataset. Questions:

  1. Can I call model.fit like this - model.fit([x,y]) where x is input_ids and y is lm_labels?

If not, how do I pass in lm_labels and train with the model.

Thanks.

keleog commented 4 years ago

@patrickvonplaten

patrickvonplaten commented 4 years ago

For the tensorflow version you have to input input_ids, decoder_input_ids and lm_labels yourself. The model should work fine with the keras framework!

patrickvonplaten commented 4 years ago

I will soon add more documentation for T5 for tensorflow. It's true that there is not enough documentation for TF at the moment.

keleog commented 4 years ago

Okay, I would appreciate that. So, do I add the input_ids, decoder_input_idsand lm_labels as keywords when calling model.fit(which I doubt) or when do I do that?

patrickvonplaten commented 4 years ago

I have not looked trained tensorflow using keras model.fit function yet. The forward pass in tensorflow's T5 implementation needs both input_ids and decoder_input_ids as you can see when going through this function: https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980

So, depending on your code you will have to create input_ids, decoder_input_ids and lm_labels yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)

keleog commented 4 years ago

Hi Patrick. Got it to work with Pytorch. However, I have a question:

Is it possible to use a different vocab size with this pretrained model? I have a trained sentence piece model and it only works with this pretrained t5 when I use a beam size of 1. I have manually changed the vocab size by setting model.config.vocab_size = tokenizer.vocab_size . However, the beam size problem still persists and it returns a shape mismatch error.

Please let me know if this is possible, thanks.

keleog commented 4 years ago

@patrickvonplaten

IslamMohamedMosaad commented 4 years ago

I think it will work in case the targets pieces from the new vocab is the same in the old one. Besides, what is the benefit from the pretrained T5 if the sentence piece targets changed ?!!

keleog commented 4 years ago

Created a little repo for NMT finetuning https://github.com/keleog/finetune_huggingace_t5

artem-spector commented 4 years ago

I have not looked trained tensorflow using keras model.fit function yet. The forward pass in tensorflow's T5 implementation needs both input_ids and decoder_input_ids as you can see when going through this function: https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980

So, depending on your code you will have to create input_ids, decoder_input_ids and lm_labels yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)

Hi @patrickvonplaten, I was able to create a data source with the input data and labels as you described. Now I'm trying to use that data for keras fit with the loss function tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) The shape of the labels is (batch_size, seq_len), and I would expect that the model TFT5ForConditionalGeneration would return the logits of shape (batch_size, seq_len, vocab_size). However its call method returns this: return decoder_outputs + encoder_outputs so I get an error: ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 27 array(s), for inputs ['output_1', 'output_2', 'output_3', 'output_4', 'output_5', 'output_6', 'output_7', 'output_8', 'output_9', 'output_10', 'output_11', 'output_12', 'output_13', 'output_14', 'output_15', 'output_16', 'output_17', 'output_18', 'output_19', 'output_20', 'output_21', 'output_22', 'output_23', 'output_24', 'output_25', 'output_26', 'output_27'] but instead got the following list of 1 arrays: [<tf.Tensor 'args_4:0' shape=(32, 128) dtype=int32>]...

I can think of two solutions, neither sounds good:

  1. override call method in a subclass and return only the decoder outputs
  2. use a custom loss function that extracts the decoder outputs from the model output

What would you advice?

bhadreshpsavani commented 4 years ago

Hi @patrickvonplaten , I am working on one question answering task using TFT5. I have done a text encoding step. My raw input is question and target is the answer shown in the below image image

How should I configure input so that I can pass it in model.fit() method like this way!! I am able to get input_id and input mask.

model = TFT5ForConditionalGeneration.from_pretrained("t5-small")
optimizer = keras.optimizers.Adam(lr=5e-5)
model.compile(optimizer=optimizer)
model.fit(
    x_train,
    y_train,
    epochs=1,  
    verbose=2,
    batch_size=2,
)

Here is the Colab Notebook

patrickvonplaten commented 4 years ago

I'll start working on a TFT5 notebook this week. Related issues: https://discuss.huggingface.co/t/how-to-train-tft5forconditionalgeneration-model/888 https://discuss.huggingface.co/t/how-to-train-t5-with-tensorflow/641/6 https://github.com/huggingface/transformers/issues/6876

QuantumStatic commented 1 year ago

Hello @patrickvonplaten

I am working with T5 for paraphrase generation, but I wanted to know if there is a way to use my own custom defined loss function for training?

patrickvonplaten commented 1 year ago

Sure! You can just output the language head logits with T5 and build your own loss with it :-)

tom192180 commented 1 year ago

Hello @patrickvonplaten, I am also new to implement T5 fine-tuning in summarization tasks. I read online tutorial from the website: https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/

I do not know why for the decoder_input_ids and labels, the author removed last and first token ID separately. See the following codes

for _, data in enumerate(loader, 0):
        y = data["target_ids"].to(device, dtype=torch.long)
       y_ids = y[:, :-1].contiguous()

        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data["source_ids"].to(device, dtype=torch.long)
        mask = data["source_mask"].to(device, dtype=torch.long)
        outputs = model(
            input_ids=ids,
            attention_mask=mask,
            decoder_input_ids=y_ids,
            labels=lm_labels,
        )

Would it be possible to ask you? I googled it and read the original paper and hugging-face documents but I still do not understand.

patrickvonplaten commented 1 year ago

Hey @tom192180,

Could you maybe try to use the forum for this: https://discuss.huggingface.co/ - thanks!

tom192180 commented 1 year ago

Hey @tom192180,

Could you maybe try to use the forum for this: https://discuss.huggingface.co/ - thanks!

Hey @patrickvonplaten! I posted to the forum. Thank you for the directing!

nanbeitk commented 1 year ago

Hello ! @patrickvonplaten, i am new to NLP and want to use T5-base model for Casual Language Modeling.

My goal is to using a specific corpus to fine-tune t5-base model with a casual language modeling, I find this document and it use AutoModelForCasualLM, but this liabrary just not include series of t5 models.

So my question is:

  1. How should I do to finetune t5 model for CLM object? In my understanding, CLM is a process of predicting token_2 from token_1 , token_3 from token_1, token_2 until the end of input sequence, so i am confused how to finish this process myself.

  2. I try to spilt one my train data into something like this (ti == token_i, 1 == eos_token): input_ids                                                     labels

I posted it to forum but reveived nothing, would it be possible to ask you?