difference between bert and GPT on finetuning dataset STS-B

google-research / bert

TensorFlow code and pre-trained models for BERT

https://arxiv.org/abs/1810.04805

Apache License 2.0

37.94k stars 9.57k forks source link

difference between bert and GPT on finetuning dataset STS-B #644

Open echoht opened 5 years ago

echoht commented 5 years ago

in paper bert, the fine-tune process on STS-B is as a classification task, and the model structure is as following, which by default put sentence1 before sentence2:

but in paper openai GPT, the fine-tune process consider both sen1_sen2 order and sen2_sen1 order. I think this is more reasonable. but why bert paper not mention this ?

dsindex commented 5 years ago

GPT uses uni-directional self-attention, so i think that those setting for similarity learning is reasonable. in contrast, BERT makes use of bidirectional self-attention mechanism which is natural for the architecture in the pic (a).

echoht commented 5 years ago

thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want.

pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?

aditya-malte commented 5 years ago

Maybe you could randomly shuffle some Sent_A and Sent_B, that should remove any hidden discrepancies in the data.

aditya-malte commented 5 years ago

Essentially, data-augmentation

ArthurRizar commented 5 years ago

i have

thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want.

pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?

i feed text_a in a bert, and feed text_b in another bert(bert reuse), but i got a worse result. i think the multi-head attention of bert is involve both text_a and text_b, so if we split them to a Siamese model, it will loss information to compute the attention

echoht commented 5 years ago

i have

thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want. pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?

i feed text_a in a bert, and feed text_b in another bert(bert reuse), but i got a worse result. i think the multi-head attention of bert is involve both text_a and text_b, so if we split them to a Siamese model, it will loss information to compute the attention

pic(b) is not put text_a and text_b into each bert, but text_a[sep]text_b into one bert and text_b[sep]text_a into another bert.