huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.91k stars 26.27k forks source link

How to train a custom seq2seq model with BertModel #4517

Closed chenjunweii closed 4 years ago

chenjunweii commented 4 years ago

How to train a custom seq2seq model with BertModel,

I would like to use some Chinese pretrained model base on BertModel

so I've tried using Encoder-Decoder Model, but it seems theEncoder-Decoder Model is not used for conditional text generation

and I saw that BartModel seems to be the model I need, but I cannot load pretrained BertModel weight with BartModel.

by the way, could I finetune a BartModel for seq2seq with custom data ?

any suggestion, thanks

patrickvonplaten commented 4 years ago

Hi @chenjunweii - thanks for your issue! I will take a deeper look at the EncoderDecoder framework at the end of this week and should add a google colab on how to fine-tune it.

flozi00 commented 4 years ago

Using Bert - Bert model for seq2seq task should work using simpletransformers library, there is an working code. But there is one strange thing that the saved models loads wrong weight's. Predicting the same string multiple times works correctly, loading the model each time again it's generating a new result every time @patrickvonplaten

patrickvonplaten commented 4 years ago

Hi @flozi00, could you add a code snippet here that reproduces this bug?

flozi00 commented 4 years ago

Of course, it should be reproduceable using this code:

import logging

import pandas as pd
from simpletransformers.seq2seq import Seq2SeqModel

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_data = [
    ["one", "1"],
    ["two", "2"],
]

train_df = pd.DataFrame(train_data, columns=["input_text", "target_text"])

eval_data = [
    ["three", "3"],
    ["four", "4"],
]

eval_df = pd.DataFrame(eval_data, columns=["input_text", "target_text"])

model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 10,
    "train_batch_size": 2,
    "num_train_epochs": 10,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    "evaluate_generated_text": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "max_length": 15,
    "manual_seed": 4,
}

encoder_type = "roberta"

model = Seq2SeqModel(
    encoder_type,
    "roberta-base",
    "bert-base-cased",
    args=model_args,
    use_cuda=True,
)

model.train_model(train_df)

results = model.eval_model(eval_df)

print(model.predict(["five"]))

model1 = Seq2SeqModel(
    encoder_type,
    encoder_decoder_name="outputs",
    args=model_args,
    use_cuda=True,
)
print(model1.predict(["five"])

It the sample code in documentation of simpletransformers library. The dataset size doesn't matter.

https://github.com/ThilinaRajapakse/simpletransformers/blob/master/README.md#encoder-decoder

patrickvonplaten commented 4 years ago

Hey @flozi00, I think #4680 fixes the error.

@chenjunweii - a Bert2Bert model using the EncoderDecoder framework should be the right approach here! You can use one Bert model as an encoder and the other Bert model as a decoder. You will have to fine-tune the EncoderDecoder model a bit, but it should work fine!

You can load the model via:

from transformers import EncoderDecoder

model = EncoderDecoder.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert

and train it on conditional language text generation providing the input_ids as context, the decoder_input_ids as the text to generate and lm_labels as your shifted text to generate. Think of it as decoder_input_ids and lm_labels being your normal inputs for causal text generation inputs and input_ids as your context to condition the model on. I will soon provide a notebook that makes this clearer.

Guitaricet commented 4 years ago

Thank you for working on this problem and thank you for 🤗 ! It looks like it is finally possible to write seq2seq models in under 10 lines of code, yay!

But I still have some questions and concerns about the EncoderDecoder.

  1. It is not clear now, how masking now works in the decoder implementation. I spent quite some time to get into it.

Documentation says that "Causal mask will also be used by default", but I did not find how to change it. E.g. what if I am training model without teacher forcing (just generating words one by one during training) or if I am doing inference?

I would suggest to add one more argument to the forward that would make it both more clear when causal masking is used and how to enable/disable it. What do you think?

  1. It is not clear what is the default decoder class.

It just feels weird to use BERT as a decoder. BERT is a mode that is a) non-autoregressive b) pre-trained without cross-attention modules. It is also unclear at which point the cross-attention modules are created. It would be great, if it is possible, to add something like TransformerDecoder model.

patrickvonplaten commented 4 years ago

Hey @Guitaricet :-) ,

First, at the moment only Bert2Bert works with the encoder-decoder framework. Also, if you use Bert as a decoder you will always use a causal mask. At the moment I cannot think of an encoder-decoder in which the decoder does not use a causal mask, so I don't see a reason why one would want to disable it. Can you give me an example where the decoder should not have a causal mask? Do you mean auto-regressive language generation by "generating words one by one"? Auto-regressive language modeling always requires a causal mask...

  1. Currently, only Bert works as a decoder. We might add GPT2 in a couple of weeks. Note that no model has cross-attention layers if it is not already an encoder-decoder model (like Bart or T5) and in this case it does not make sense to use the encoder-decoder wrapper. The model is initialized with random weights for the cross attention layers which will have to be fine-tuned. I agree, that this should be made clearer in the documentation!
AshOlogn commented 4 years ago

I'm trying to build a Bert2Bert model using EncoderDecoder, but I have a couple quick questions regarding the format of inputs and targets for the BERT decoder.

What exactly is a good way to format the conditional mask to the decoder. For example, if I want to feed the decoder [I, am] and make it output [I, am, happy], how exactly do I mask the input? Do I give the decoder [CLS, I, am, MASK, ...., MASK, SEP] where the number of MASKs is such that the total number of tokens is a fixed length (like 512)? Or do I just input [CLS, I, am, MASK, SEP, PAD, ..., PAD]?

Similarly, what should the decoder's output be? Does the first token (the "output" of CLS) be the token "I"?

Lastly, is there a website or resource that explains the input and output representations of text given to the decoder in Bert2Bert? I don't think the authors of the paper have released their code yet.

Thanks!

patrickvonplaten commented 4 years ago

I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this: https://github.com/huggingface/transformers/issues/4647

Maybe it helps.

Guitaricet commented 4 years ago

Thank you @patrickvonplaten for clarification

  1. I see why not using a causal mask seems weird and I agree with you. I can think of two reasons not to use a causal mask for generation: 1) inference: you don't have any future to look into, thus the mask is not strictly needed (you won't be able to cache the decoder states though) 2) you can train a model without teacher forcing, i.e. during training forwarding your decoder tgt_len times only using the words that has been predicted by the model instead of feeding the ground truth.

It is very possible that both of these cases are rare, so the library may not need causal_masking argument, but at least some clarification may be needed. This is the reason why I found this issue in the first place.

  1. Yes, improving the documentation would help a lot! Still, I would argue that a designated Decoder class is a much more clear way if you want to train it from scratch.

I also noticed that config.is_decoder option is only documented in BertModel and not in BertConfig class. Adding it would help a lot. (I only found it because I thought that it is not documented at all and wanted to check my claim via searching for "is_decoder" in the source code)

Again, thank you for you work, 🤗 is what NLP community needed for quite some time!

UPD: more reasons to use a different attention mask (not for seq2seq though) XLNet-like or ULM-like pre-training

antoniomrfranco commented 4 years ago

I will soon release a bert2bert notebook that will show how to do this. You can also take a look at this:

4647

Maybe it helps.

Hi @patrickvonplaten ,

Thanks for the clarification on this topic and for the great work you've been doing on those seq2seq models. Is this notebook you mentioned here already available?

Thanks.

patrickvonplaten commented 4 years ago

Yeah, the code is ready in this PR: https://github.com/huggingface/transformers/tree/more_general_trainer_metric . The script to train an Encoder-Decoder model can be assessed here: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/bert_encoder_decoder_summary.py

And in order for the script to work, you need to use this Trainer class: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py

I'm currently training the model myself. When the results are decent, I will publish a little notebook.

mingzi151 commented 4 years ago

Hi @patrickvonplaten, thanks for sharing the scripts. However, the second link for training an encoder-decoder model is not found. Could you please upload this script? Thanks.

ghost commented 4 years ago

You

patrickvonplaten commented 4 years ago

Sorry, I deleted the second link. You can see all the necessary code on this model page: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16#bert2bert-summarization-with-%F0%9F%A4%97-encoderdecoder-framework

mingzi151 commented 4 years ago

Thanks for sharing this, Patrick.

AmbiTyga commented 4 years ago

I am trying to implement a encoder decoder with BART but I have no idea how to do so, and I need to fine tune the decoder model, so eventually I need to train my decoder model. I am trying to use the EncoderDecoder model in my script but I don't know how to access the decoder model for training it. Instead of using the module, I initialized BartModel as encoder,whereas for decoder I used BartForConditionalGeneration. Here's the model I initialized

encoder = BartModel.from_pretrained('facebook/bart-base)
decoder = BartForConditionalGeneration.from_pretrained('facebook/bart-base)

And here's how I am using it.

for epoch in range(epochs):
        #------------------------training------------------------
        decoder.train()
        losses = 0
        times = 0
        print('\n'+'-'*20 + f'epoch {epoch}' + '-'*20)
        for batch in tqdm(train_dataloader):
            batch = [item.to(device) for item in batch]

            encoder_input, decoder_input, mask_encoder_input, mask_decoder_input = batch

            lhs,hs,att,_,_,_ = encoder(input_ids = encoder_input, attention_mask = mask_encoder_input,output_attentions = True,output_hidden_states = True)
            past = (lhs,hs,att)

            logits,_,_,_= decoder(input_ids = decoder_input, attention_mask = mask_decoder_input, encoder_outputs = past)

            out = logits[:, :-1].contiguous()
            target = decoder_input[:, 1:].contiguous()
            target_mask = mask_decoder_input[:, 1:].contiguous()

            loss = util.sequence_cross_entropy_with_logits(out, target, target_mask, average="token")
            loss.backward()

            losses += loss.item()
            times += 1

            update_count += 1

            if update_count % num_gradients_accumulation == num_gradients_accumulation - 1:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

I am calculating perplexity from the loss, and I am getting a perplexity score of 1000+, which is bad. I would like to know whats my model is lacking and is it possible that I could use EncoderDecoder module

iliemihai commented 4 years ago

@AmbiTyga from what I know, BART is already a encoder-decoder model, with a BERT as a encoder and a GPT as a decoder. So you are encoding-decoding in encoder and encoding-decoding in decoder, which I don t think is a good idea. For the moment EncoderDecoderModel supports only BERT.

AmbiTyga commented 4 years ago

@iliemihai So can you refer me how to use BART in such cases like I have coded above?

spookypineapple commented 4 years ago

@patrickvonplaten is Bert the only model that is supported as a decoder? I was hoping to train a universal model so wanted to use xlm-roberta (xlmr) as both encoder and decoder; Is this possible given the current EncoderDecoder framework? I know bert has a multilingual checkpoint but performance-wise an xlm-roberta model should be better. I noticed the notebook https://github.com/huggingface/transformers/blob/16e38940bd7d2345afc82df11706ee9b16aa9d28/model_cards/patrickvonplaten/roberta2roberta-share-cnn_dailymail-fp16/README.md does roberta2roberta; is this same code applicable to xlm-roberta? I tried following the same template with xlmr but I noticed that the output is the same regardless of the input - the is_decoder flag is properly set to True in the decoder but this issue persists.

patrickvonplaten commented 4 years ago

Hey @spookypineapple - good question! Here is the PR that adds XLM-Roberta to the EncoderDecoder models: https://github.com/huggingface/transformers/pull/6878

will not make it to 3.1.0 but should be available on master in ~1,2 days

spookypineapple commented 4 years ago

Im pulling from master so I should get at least the neccessary code artifacts to get bert2bert to work. However Im seeing (for a bert2bert setup using bert-base-multilingual-cased) that the output of the decoder remains unchanged regardless of the input to the encoder; this behavior seems to persist with training... The code im using to initialize the EncoderDecoder model is as follows:

import torch
from transformers import (
    MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
    AdamW,
    get_linear_schedule_with_warmup,
    AutoConfig,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    EncoderDecoderModel
)
model_type = 'bert'
model_name = config_name = tokenizer_name = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_name,
    do_lower_case=False,
    cache_dir=None,
    force_download=False
)
config = AutoConfig.from_pretrained(
    config_name,
    cache_dir=None,
    force_download=False
)
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    model_name,  # encoder
    model_name,  # decoder
    from_tf=bool(".ckpt" in model_name),
    config=config,
    cache_dir=None,
)
if model_type in ['bert']:
    tokenizer.bos_token = tokenizer.cls_token
    tokenizer.eos_token = tokenizer.sep_token
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.tie_weights()
model.decoder.config.use_cache = False

input_str1 = "this is the first example"
input_str2 = "and heres another example for you"
input_encodings1 = tokenizer.encode_plus(input_str1,
                                         padding="max_length",
                                         truncation=True,
                                         max_length=512,
                                         return_tensors="pt")
input_encodings2 = tokenizer.encode_plus(input_str2,
                                         padding="max_length",
                                         truncation=True,
                                         max_length=512,
                                         return_tensors="pt")
gen1 = model.generate(input_ids=input_encodings1.input_ids,
                      attention_mask=input_encodings1.attention_mask,
                      max_length=25,
                      decoder_start_token_id=model.config.decoder_start_token_id
                      )
gen2 = model.generate(input_ids=input_encodings2.input_ids,
                      attention_mask=input_encodings2.attention_mask,
                      max_length=25,
                      decoder_start_token_id=model.config.decoder_start_token_id
                      )
dec1 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen1]
dec2 = [tokenizer.decode(ids, skip_special_tokens=True) for ids in gen2]
print(dec1)
print(dec2)

# the outputs are identical even though the inputs are different
patrickvonplaten commented 4 years ago

Hey @spookypineapple,

A couple of things regarding your code:

1) .from_encoder_decoder_pretrained() usually does not need a config. The way you use this function with a conifg inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Also from_tf will also only apply to the encoder. You would additionally have to pass decoder_from_tf.

2) An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.

=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Hope this helps!

spookypineapple commented 4 years ago

Hey @spookypineapple,

A couple of things regarding your code:

  1. .from_encoder_decoder_pretrained() usually does not need a config. The way you use this function with a conifg inserted means that you are overwriting the encoder config, which is not recommended when loading an encoder decoder model from two pretrained "bert-base-multilingual-cased" checkpoints. Also from_tf will also only apply to the encoder. You would additionally have to pass decoder_from_tf.
  2. An encoder decoder model initialized from two pretrained "bert-base-multilingual-cased" checkpoints needs to be fine-tuned before any meaningful results can be seen.

=> You might want to check these model cards of bert2bert which explain how to fine-tune such an encoder decoder model: https://huggingface.co/patrickvonplaten/bert2bert-cnn_dailymail-fp16

Hope this helps!

It does help indeed! Thankyou @patrickvonplaten

zmf0507 commented 3 years ago

@patrickvonplaten can you please share a tutorial/notebook on training the encoder-decoder model for machine translation?

Atharva-Phatak commented 3 years ago

@patrickvonplaten can you create a notebook on how to use custom dataset to fine tune bert2bert models ?

marius-sm commented 3 years ago

Hey @Guitaricet :-) ,

First, at the moment only Bert2Bert works with the encoder-decoder framework. Also, if you use Bert as a decoder you will always use a causal mask. At the moment I cannot think of an encoder-decoder in which the decoder does not use a causal mask, so I don't see a reason why one would want to disable it. Can you give me an example where the decoder should not have a causal mask? Do you mean auto-regressive language generation by "generating words one by one"? Auto-regressive language modeling always requires a causal mask...

  1. Currently, only Bert works as a decoder. We might add GPT2 in a couple of weeks. Note that no model has cross-attention layers if it is not already an encoder-decoder model (like Bart or T5) and in this case it does not make sense to use the encoder-decoder wrapper. The model is initialized with random weights for the cross attention layers which will have to be fine-tuned. I agree, that this should be made clearer in the documentation!

I would like to disable causal masking to use it in DETR, which uses parallel decoding... But this not seem possible at the moment. In my opinion, an option to disable causal masking in the decoder would be useful

miguelwon commented 2 years ago

Yeah, the code is ready in this PR: https://github.com/huggingface/transformers/tree/more_general_trainer_metric . The script to train an Encoder-Decoder model can be assessed here: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/bert_encoder_decoder_summary.py

And in order for the script to work, you need to use this Trainer class: https://github.com/huggingface/transformers/blob/more_general_trainer_metric/src/transformers/trainer.py

I'm currently training the model myself. When the results are decent, I will publish a little notebook.

@patrickvonplaten , none of the links is working. Is it possible to fix them?

patrickvonplaten commented 2 years ago

For BERT2BERT you can just use the EncoderDecoderModel class as shown here: https://huggingface.co/docs/transformers/v4.21.3/en/model_doc/encoder-decoder#transformers.EncoderDecoderModel.forward.example

This example shows how to instantiate a Bert2Bert model which you can then train on any seq2seq task you want, e.g. summarization: https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization (you just need to slighly adapt the example, or pre-create a BERT2BERT and use it as a checkpoint)

miguelwon commented 2 years ago

Thanks! Btw, I just submitted an issue and tagged you. There's some problem when using EncoderDecoderModel with the most recent transformers versions.