Open echoht opened 5 years ago
GPT uses uni-directional self-attention, so i think that those setting for similarity learning is reasonable. in contrast, BERT makes use of bidirectional self-attention mechanism which is natural for the architecture in the pic (a).
thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want.
pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?
Maybe you could randomly shuffle some Sent_A and Sent_B, that should remove any hidden discrepancies in the data.
Essentially, data-augmentation
i have
thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want.
pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?
i feed text_a in a bert, and feed text_b in another bert(bert reuse), but i got a worse result. i think the multi-head attention of bert is involve both text_a and text_b, so if we split them to a Siamese model, it will loss information to compute the attention
i have
thank you for reply. But Bert’s pretraining process also cansider the relationship between sentence a and sentence b, so in theory, in fine tune process for sts-b, the order maybe cansidered in some hidden way, which is not what we want. pic(a) is ok, because of the experient result is better than gpt in bert’paper , what if we use the same way as pic(b) in bert fine tune, can we get better result?
i feed text_a in a bert, and feed text_b in another bert(bert reuse), but i got a worse result. i think the multi-head attention of bert is involve both text_a and text_b, so if we split them to a Siamese model, it will loss information to compute the attention
pic(b) is not put text_a and text_b into each bert, but text_a[sep]text_b into one bert and text_b[sep]text_a into another bert.
in paper bert, the fine-tune process on STS-B is as a classification task, and the model structure is as following, which by default put sentence1 before sentence2:
but in paper openai GPT, the fine-tune process consider both sen1_sen2 order and sen2_sen1 order. I think this is more reasonable. but why bert paper not mention this ?