Encoder/Decoder generation

anishthite commented 4 years ago

Hello! I tried to train a Bert2Bert model for QA generation, however when I try the generate function it returns gibberish. I also tried using the example code below, and that also generated gibberish(the output is "[PAD] leon leon leon leon leonieieieieie shall shall shall shall shall shall shall shall shall"). Is the generate function supposed to work for EncoderDecoder models, and what am I doing wrong?

from transformers import EncoderDecoderModel, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)

patil-suraj commented 4 years ago

Are you using this exact line

model = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert

If yes, then please use paths for your saved model. Few other things to try: verify data pipeline, try using beam search or sampling in generate

anishthite commented 4 years ago

Thanks! I am using that exact line. I saved my trained model using save_pretrained() and it saved everything as one file. How would I separate this, or should I just retrain and re-save the encoder and decoder separately? Also, does the untrained model not work due to the untrained cross attention layer?

patil-suraj commented 4 years ago

If you saved your model using .save_pretrained then you can load it using just .from_pretrained as you load any other HF model. Just pass the path of your saved model. You won't need to use .from_encoder_decoder_pretrained

patrickvonplaten commented 4 years ago

Hi @anishthite,

How did you train your Bert2Bert model? Can you post the code you used to train your model here? Dontt worry if it's a very long code snippet :-)

anishthite commented 4 years ago

Hello! I managed to figure out the issue. I retrained and saved the encoder and decoder in their own folders. I then was able to load it in as @patil-suraj suggested. I guess earlier it was loading in the untrained model. Would it be helpful to redefine save_pretrained() for EncoderDecoder models to automatically split it into an encoder and decoder folder I can submit a PR if you want.

    dataset = QADataset(dataset=args.traindataset, block_size=args.maxseqlen)
    qa_loader = DataLoader(dataset, batch_size=args.batch, shuffle=True)
    model.train()
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
    t_total = len(qa_loader) // args.gradient_acums * args.epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps = t_total)
    proc_seq_count = 0
    sum_loss = 0.0
    batch_count = 0
    models_folder = "combinerslargeencoder"
    models_folder2 = "combinerslargedecoder"
    if not os.path.exists(models_folder):
        os.mkdir(models_folder)
    if not os.path.exists(models_folder2):
        os.mkdir(models_folder2)
    for epoch in range(args.epochs):

        print(f"EPOCH {epoch} started" + '=' * 30)

        for idx,qa in enumerate(qa_loader):
            print(str(idx) + ' ' + str(len(qa_loader)))
            inputs, labels = (qa[0], qa[1])
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(input_ids=inputs, decoder_input_ids=labels, lm_labels=labels)
            loss, logits = outputs[:2]
            loss = loss / args.gradient_acums
            loss.backward()
            sum_loss = sum_loss + loss.detach().data

            #proc_seq_count = proc_seq_count + 1
            #if proc_seq_count == args.gradient_acums:
            #    proc_seq_count = 0    
            batch_count += 1
            if (idx + 1) % args.gradient_acums == 0:
                optimizer.step()
                scheduler.step() 
                optimizer.zero_grad()
                model.zero_grad()

            if batch_count == 100:
                print(f"sum loss {sum_loss}")
                batch_count = 0
                sum_loss = 0.0

        # Store the model after each epoch to compare the performance of them
        torch.save(model.state_dict(), os.path.join(models_folder, f"combined_mymodel_{args.maxseqlen}{epoch}{args.gradient_acums}.pt"))
        model.save_pretrained(models_folder)
        model.encoder.save_pretrained(models_folder)
        model.decoder.save_pretrained(models_folder2)
        evaluate(args, model, tokenizer)

patrickvonplaten commented 4 years ago

Why do you save the encoder and decoder model seperately?:

        model.encoder.save_pretrained(models_folder)
        model.decoder.save_pretrained(models_folder2)

This line:

       model.save_pretrained(models_folder)

should be enough.

We moved away from saving the model to two separate folders, see: https://github.com/huggingface/transformers/pull/3383. Also the docs: https://huggingface.co/transformers/model_doc/encoderdecoder.html might be useful.

huggingface / transformers

Encoder/Decoder generation #4171