Closed Palipoor closed 4 years ago
+1. I'm also confused on how to structure the lm_labels and the decoder_input_ids.
Given T5's universal text-to-text objective, I'm under the impression that the T5 summarization example should be applicable for all T5 tasks, as long as the input and target sequences are correctly structured for the specified task. Hope this can be confirmed!
Sample input and target structures for specific tasks can be found at Appendix D in the T5 paper.
To correctly train T5 one should follow the instructions at https://huggingface.co/transformers/model_doc/t5.html#training .
For training, there is no need to provide the decoder_input_ids
- they are created automatically. One only has to provide the lm_labels
.
As @enzoampil, Appendix D of the paper gives good input/output examples.
@patrickvonplaten What exactly would be the lm_labels
for something like summarization?
Example Usecase Text: "ABC" with maximum length 500 Summary: "XYZ" with maximum length 50
I understand that we can prepare input_ids
and attention_mask
like this for the document.
x = tokenizer.encode_plus(sentence,
max_length=500,
pad_to_max_length=True,
return_tensors='pt')
Now for the lm_labels i.e. summary, is simply doing this enough?
lm_labels = tokenizer.encode(summary,
return_tensors='pt',
max_length=50,
pad_to_max_length=True)
And the model as
model = T5ForConditionalGeneration.from_pretrained('t5-small')
model(input_ids=..., lm_labels=lm_labels, attention_mask=...)
In your examples folder for summarization, I've seen some preprocessing like this for lm_labels. I didn't understand why this is being done.
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
Hi @amitness,
For T5 summarization you will have to append the prefix "summarize: " to every input data. But you are more or less right. All you have to do is:
x = tokenizer.encode_plus("summarize: " + sentence,
max_length=500,
pad_to_max_length=True,
return_tensors='pt')
lm_labels = tokenizer.encode_plus(summary,
return_tensors='pt',
max_length=50,
pad_to_max_length=True)
lm_labels[lm_labels == tokenizer.pad_token_id] = -100
There is no need to shift the tokens as you show at the end of your comment because T5 does that automatically - see https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_t5.py#L1063.
This is also explained in https://huggingface.co/transformers/model_doc/t5.html#training .
Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side 😄
Gotcha for me was that the decoder_input_ids
at inference should be prepended by the padding token as stated in the docs for T5ForConditionalGeneration
.
@enzoampil Can you give an example code of what you meant by prepending padding token at inference time?
@patrickvonplaten Thank you.
Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?
@amitness
E.g. in your summarization case, it would look something like:
from transformers import T5Tokenizer, T5Model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")
decoder_input_ids = tokenizer.encode("<pad>", return_tensors="pt")
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
outputs[0]
Do note that T5ForConditionalGeneration
already prepends the padding by default. Above is only necessary if you're doing a forward pass straight from T5Model
.
Regarding your question about making your own prefix, yes, you should be able to train on your own prefix. This is the whole point of T5's text-to-text approach. You should be able to specify any problem through this kind of approach (e.g. Appendix D in the T5 paper).
@enzoampil Makes sense. Thank you so much.
@patrickvonplaten Thank you.
Besides the inbuilt prefix like summarize:, translate: etc, can I train with my own prefix? Let's say there is a prefix called "simplify:" and I have pairs of datasets. Is adding the prefix and preparing data in the format you mentioned above enough?
Sure, you can train with your own prefix.
Thanks for this clarification @patrickvonplaten ! Finally got it to work from my side
Gotcha for me was that the
decoder_input_ids
at inference should be prepended by the padding token as stated in the docs forT5ForConditionalGeneration
.
Yeah that's actually a bit hidden in the code. So to clarify:
During training, there is no need to prepend the padding token since this is done automatically in T5 when lm_labels
is provided.
During evaluation, one has to prepend the PAD token as you stated in your example.
After training, the mode can be used with the generate()
method (which actually powers the summarization
, translation
and text-generation
pipeline).
In the generate()
method, the padding token is automatically prepended.
@patrickvonplaten One thing I've noticed is the discrepancy between huggingface's and the original google-research tokenization.
In the official colab by the paper authors, they seem to add </s>
when tokenizing to the end of each text. But, when we use tokenizers from hugging face, it is not added. Not sure if it is a problem or not.
Here is an excerpt from their official colab
'inputs_plaintext': b'trivia question: what is the population of fayetteville north carolina?', 'inputs': array([22377, 822, 10, 125, 19, 8, 2074, 13, 3,
89, 9, 63, 1954, 1420, 3457, 443, 12057, 9,
58, 1])
You can see 1 added at the end of the token_ids. But if we tokenize this same sentence with huggingface tokenizer, we don't get 1 at end.
tokenizer.encode('trivia question: what is the population of fayetteville north carolina?')
# [22377, 822, 10, 125, 19, 8, 2074, 13, 3, 89, 9, 63, 1954, 1420, 3457, 443, 12057, 9, 58]
When I was prototyping with the models, I tried preparing data like this to solve it. This adds 1 to the end. Not sure if we need to do this or not.
tokenizer.encode("summarize: Hello world</s>", return_tensors="pt")
Yes you are right, you should add the </s>
token to the end of a sentence. I think this is also shown in the docs: https://huggingface.co/transformers/model_doc/t5.html#training.
Thanks to @patrickvonplaten for all clarification and others for their further questions that led to more details on the subject.
Hello everyone,
I am currently working on finetuning the TFT5ForConditionalGeneration model on a parallel dataset. Questions:
model.fit([x,y])
where x is input_ids and y is lm_labels?If not, how do I pass in lm_labels and train with the model.
Thanks.
@patrickvonplaten
For the tensorflow version you have to input input_ids
, decoder_input_ids
and lm_labels
yourself. The model should work fine with the keras
framework!
I will soon add more documentation for T5 for tensorflow. It's true that there is not enough documentation for TF at the moment.
Okay, I would appreciate that. So, do I add the input_ids
, decoder_input_ids
and lm_labels
as keywords when calling model.fit
(which I doubt) or when do I do that?
I have not looked trained tensorflow using keras model.fit
function yet. The forward pass in tensorflow's T5 implementation needs both input_ids
and decoder_input_ids
as you can see when going through this function:
https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980
So, depending on your code you will have to create input_ids
, decoder_input_ids
and lm_labels
yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)
Hi Patrick. Got it to work with Pytorch. However, I have a question:
Is it possible to use a different vocab size with this pretrained model? I have a trained sentence piece model and it only works with this pretrained t5 when I use a beam size of 1. I have manually changed the vocab size by setting model.config.vocab_size = tokenizer.vocab_size
. However, the beam size problem still persists and it returns a shape mismatch error.
Please let me know if this is possible, thanks.
@patrickvonplaten
I think it will work in case the targets pieces from the new vocab is the same in the old one. Besides, what is the benefit from the pretrained T5 if the sentence piece targets changed ?!!
Created a little repo for NMT finetuning https://github.com/keleog/finetune_huggingace_t5
I have not looked trained tensorflow using keras
model.fit
function yet. The forward pass in tensorflow's T5 implementation needs bothinput_ids
anddecoder_input_ids
as you can see when going through this function: https://github.com/huggingface/transformers/blob/fd2174664c8879c747ada3e6e0a2486858808421/src/transformers/modeling_tf_t5.py#L980So, depending on your code you will have to create
input_ids
,decoder_input_ids
andlm_labels
yourself. Feel free to share your code here if you have a working training pipeline for TFT5 :-)
Hi @patrickvonplaten, I was able to create a data source with the input data and labels as you described.
Now I'm trying to use that data for keras fit with the loss function tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
The shape of the labels is (batch_size, seq_len), and I would expect that the model TFT5ForConditionalGeneration
would return the logits of shape (batch_size, seq_len, vocab_size). However its call method returns this:
return decoder_outputs + encoder_outputs
so I get an error:
ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 27 array(s), for inputs ['output_1', 'output_2', 'output_3', 'output_4', 'output_5', 'output_6', 'output_7', 'output_8', 'output_9', 'output_10', 'output_11', 'output_12', 'output_13', 'output_14', 'output_15', 'output_16', 'output_17', 'output_18', 'output_19', 'output_20', 'output_21', 'output_22', 'output_23', 'output_24', 'output_25', 'output_26', 'output_27'] but instead got the following list of 1 arrays: [<tf.Tensor 'args_4:0' shape=(32, 128) dtype=int32>]...
I can think of two solutions, neither sounds good:
call
method in a subclass and return only the decoder outputsWhat would you advice?
Hi @patrickvonplaten , I am working on one question answering task using TFT5. I have done a text encoding step. My raw input is question and target is the answer shown in the below image
How should I configure input so that I can pass it in model.fit() method like this way!! I am able to get input_id and input mask.
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")
optimizer = keras.optimizers.Adam(lr=5e-5)
model.compile(optimizer=optimizer)
model.fit(
x_train,
y_train,
epochs=1,
verbose=2,
batch_size=2,
)
Here is the Colab Notebook
I'll start working on a TFT5 notebook this week. Related issues: https://discuss.huggingface.co/t/how-to-train-tft5forconditionalgeneration-model/888 https://discuss.huggingface.co/t/how-to-train-t5-with-tensorflow/641/6 https://github.com/huggingface/transformers/issues/6876
Hello @patrickvonplaten
I am working with T5 for paraphrase generation, but I wanted to know if there is a way to use my own custom defined loss function for training?
Sure! You can just output the language head logits with T5 and build your own loss with it :-)
Hello @patrickvonplaten, I am also new to implement T5 fine-tuning in summarization tasks. I read online tutorial from the website: https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/
I do not know why for the decoder_input_ids and labels, the author removed last and first token ID separately. See the following codes
for _, data in enumerate(loader, 0):
y = data["target_ids"].to(device, dtype=torch.long)
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone().detach()
lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
ids = data["source_ids"].to(device, dtype=torch.long)
mask = data["source_mask"].to(device, dtype=torch.long)
outputs = model(
input_ids=ids,
attention_mask=mask,
decoder_input_ids=y_ids,
labels=lm_labels,
)
Would it be possible to ask you? I googled it and read the original paper and hugging-face documents but I still do not understand.
Hey @tom192180,
Could you maybe try to use the forum for this: https://discuss.huggingface.co/ - thanks!
Hey @tom192180,
Could you maybe try to use the forum for this: https://discuss.huggingface.co/ - thanks!
Hey @patrickvonplaten! I posted to the forum. Thank you for the directing!
Hello ! @patrickvonplaten, i am new to NLP and want to use T5-base model for Casual Language Modeling.
My goal is to using a specific corpus to fine-tune t5-base model with a casual language modeling, I find this document and it use AutoModelForCasualLM
, but this liabrary just not include series of t5 models.
So my question is:
How should I do to finetune t5 model for CLM object? In my understanding, CLM is a process of predicting token_2
from token_1
, token_3
from token_1, token_2
until the end of input sequence, so i am confused how to finish this process myself.
I try to spilt one my train data into something like this (ti == token_i, 1 == eos_token): input_ids labels
[t1, 1, 1, 1, 1, 1, ...]
[t1, t2, 1, 1, 1, 1, ...]
[t1, t2, 1, 1, 1, 1, ...]
[t1, t2, t3, 1, 1, 1, ...]
[t1, t2, t3, 1, 1, 1, ...]
[t1, t2, t3, t4, 1, 1, ...]
[t1, t2, t3, t4, 1, 1, ...]
[t1, t2, t3, t4, t5, 1, ...]
The first problem is obvious, the expanded dataset is too large and requires more time to fine-tune; The second problem is that this seems strange, and I don't know if this fulfills the CLM's mission requirements. This is the only idea that i can catch up to solve this problem, does it work?
I posted it to forum but reveived nothing, would it be possible to ask you?
Hi, I want to fine-tune T5 for a seq2seq task and I'm using the T5ForConditionalGeneration as it seems to have an LM decoder on top. As there's no code example for this, I have lots of questions:
forward
inputs in the training phase. I read this explanation over and over again and I don't understand whether I should just useinput_ids
andlm_labels
for the training or not. Also somewhere in this issue someone's mentioned that:So which one is right? I'm super confused.