p208p2002 / Transformer-QG-on-SQuAD

Implement Question Generator with SOTA pre-trained Language Models (RoBERTa, BERT, GPT, BART, T5, etc.)
https://huggingface.co/p208p2002/bart-squad-qg-hl
47 stars 3 forks source link

Question Generation in Recurrent Form #1

Closed soda-lsq closed 2 years ago

soda-lsq commented 2 years ago

Thanks for your wonderful Transformer-based QG code. Could you give some advice on how to implement recurrent BERT in the paper?

p208p2002 commented 2 years ago

I used to implement recurrent BERT in this project (https://github.com/p208p2002/Transformer-QG-on-SQuAD/commit/9fd761d2c933f22b5de12580965537d16983724e) (https://github.com/p208p2002/Transformer-QG-on-SQuAD/tree/9967fcffccf647831f9e8b824f56f85fd52b952f#masked-lm), but encountered some unexpected behavior, I guess it might be because the wrong input, setting or bug, with some of them: attention mask, special token of [CLS], [SEP] or [HL],unable to convergence(due to we ask model learning in token-level rather then sentence-level in seq2seq LM or causal LM) , so the model didn't learn to predict the stop token ([SEP] or ?).

The model predict will like: Who wrote the books?????????....... , even it still work but can't reach my except. Another more important issue is that recurrent BERT is very costly to fine-tune, since token-level learning you have to split sentence for learning. e.g. Who wrote the books? may split into 6 data (see below)

[CLS] [MASK] -> who
[CLS] who [MASK] -> wrote
[CLS] who wrote [MASK] -> the
[CLS] who wrote the [MASK] -> books
[CLS] who wrote the books [MASK] -> ?
[CLS] who wrote the books ? [MASK] -> [SEP]

but for seq2seq LM or causal LM just one:

[CLS] who  wrote the  books     ?
   ↓   ↓    ↓      ↓     ↓      ↓
 who  wrote the  books   ?    [SEP]  

Thanks for Masked Self-Attention, we only need updated once This post should help understanding Masked Self-Attention

We can see from the example, recurrent BERT is 6x slower then seq2seq LM or causal LM. So I finally decided remove recurrent BERT from this project.

English is not my native language, but I will try my best to answer you, If you still have any questions or I didn’t explain clearly, please feel free to ask.

p208p2002 commented 2 years ago

BERT is composed by Transformer-Encoder, without Decoder it's hard and not recommended using in text-generation task.

I still found an article about BERT MLM implementation, It should help you for implement recurrent BERT after reading https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c

soda-lsq commented 2 years ago

Hi Philip,

Really appreciate your kind and plentiful reply, thank you so much! Actually, I am a little curious about the meaning of the Recurrent Bert in the Question Generation task.

  1. In my point of view of Recurrent Bert implementation, Recurrent Bert is a model architecture similar to a Recurrent Neural Network with a Bert cell, so does it need a loop in each training step like an RNN decoder?
     
    # Example in the paper
    inputs = tokenized("[CLS] The Super Bowl 50 was played at Santa Clara, California. [SEP] Santa Clara, California. [SEP] [MASK]")
    for di in range(target_length):  
    bert_last_hidden = Bert(inputs)
    fill_in_token_logits = Softmax(linear(bert_last_hidden))
    fill_in_mask_token = argmax(fill_in_token_logits)          # Such as we choose 'Where' token to fill in [MASK]
    # We add the newly selected token at the end of sequence to construct new inputs.
    inputs = tokenized("[CLS] The Super Bowl 50 was played at Santa Clara, California. [SEP] Santa Clara, California. [SEP] Where[MASK]")
    

However, if we implement Recurrent Bert in this way, I do think it will face lots of troubles. Firstly, It will definitely have a large burden to fine-tune Bert for having a loop in each training step. Then, I thought Bert is used to filling in random [MASK]s through the Masked Language Model pretraining method, I do not know whether it is appropriate to fill in sequential [MASK]s until a [SEP] token is predicted. It is much more like what a GPT-2 Model trying to do? I am not sure about the difference between Recurrent Bert and GPT-2. Do you have any idea?

  1. Secondly, as you mentioned, Bert is actually a Transformer-Encoder which is not suitable for the generation task. But if you switch it to a Transformer-Decoder model like GPT-2, I think it may not need to process it in a recurrent way. So I am actually a little confused about the idea of using a Recurrent Pretrained Transformer-Encoder instead of a Pretrained Transformer-Decoder.

Actually English is not my native language either but I can totally understand your answer and I am very grateful for your consideration.

Thanks again and have a nice day~~~

p208p2002 commented 2 years ago

A1

Since BERT won't take the last hidden output as input, we do not need the loop you mentioned in training phase, it should done in data preprocess (split data into below form):

[CLS] [MASK] -> who
[CLS] who [MASK] -> wrote
[CLS] who wrote [MASK] -> the
[CLS] who wrote the [MASK] -> books
[CLS] who wrote the books [MASK] -> ?
[CLS] who wrote the books ? [MASK] -> [SEP]

consider that form is a batch, emphasize again BERT no need to wait the last hidden output as input.

But do needs in predict phase, due to the design of Recurrent BERT need the last token output as input.

I also note that there may some problem in your loop for decoding from [MASK], about using the torch.argmax, please see below

import torch

# assume the fake_last_hidden from "how are [MASK]"
# and model vocabs = ['how','are','you']
vocabs = ['how','are','you']
fake_last_hidden = torch.LongTensor([[5,1,2],[3,17,5],[6,7,10]]).unsqueeze(0)

# the output (from model) shape should be [batch_size,seq_len,vocab_size]
print(fake_last_hidden.shape)

# do argmax on the position of [MASK]
# torch.argmax(fake_last_hidden) <- this is worng, you will get `4` which is index of `17` (from fake_last_hidden.flat())
predict_ids = torch.argmax(fake_last_hidden,dim=-1) # index of each seq's max value 0->5, 1->17, 2->10
print(predict_ids)

# decoding using [MASK]'s hidden 
predict_id = predict_ids[0][-1] # 0 means first data in batch, carefully while batch_size > 1
vocabs[predict_id] # you

A2

It's a very good question! I've been discussing with my classmate before. The conclusion is BERT will re-encode in each step, but GPT do won't, that means the previous token's logits or hidden always changing for BERT, this is reasonable, because BERT is design for Bidirectional. GPT fixed previous hidden so that we can training in parallel, because GPT won't cheating to look the future information. You could said that recurrent BERT keep more rich information, but low efficiency.

This is the operation mechanism that I mention before (Self-Attention and Masked Self-Attention), also I've a little experiment for proof: https://gist.github.com/p208p2002/537727bd8567f05564dd6b17c5638d83

soda-lsq commented 2 years ago

Hi Philip,

Thanks again for your detailed explanation. Sorry for the late reply for trying to figure out the problems.

For A1, given the loop is conducted in the data pre-processing stage, I could be aware of how the loop is processed. I also used to break down one sentence into several sentence segments for training as the project did, in a different generation task, but still found the same problem that the generated sentence in the decoding stage is hard to stop. I find your explanation that the phenomenon is due to the model being trained in token-level rather than sentence-level is very reasonable.

For A2, I see that Recurrent Bert could contain richer information due to the Bidirectional Re-encode, which keeps more information but low efficiency compared with GPT. And thanks for your demo.

I briefly browsed the papers that cited this article, it seems that the Recurrent Pretrained-model in token-level generation task does not have a huge impact since not so much work follows the architecture. However, Recurrent Pretrained-model in sentence-level has more application. In some cases, the input sentence exceeds 512, to encode the long sentence, a recurrent layer is added upon the pre-trained layer to let the pre-trained information flow across segments.

For now, I have understood all the problems. Thanks so much. This is the first time that my question has been answered in such a sufficient and detailed way, I really appreciate your help.