Questtion about ELECTRADataProcessor or ExampleBuilder

miyamonz commented 3 years ago

First, thanks to share this repo! it's very helpful for me to understand pretraining ELECTRA.

I got a question about ELECTRADataProcessor. https://github.com/richarddwang/electra_pytorch/blob/80d1790b6675720832c5db5f22b7e036f68208b8/_utils/utils.py#L101

I read this code and found it corresponds to this file. https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py#L34

I can understand what the part does. It's a preprocessing thing. randomly split sentences to two segments, and merge it as an example and so on. But, I can't understand why it does. I read the ELECTRA paper roughly, but I can't found it. On my understanding, ELECTRA just needs many sentences like BERT. Why 2 segments are needed, and why it is randomly split in preprocess time?

I already ask it here, but there is no response. https://github.com/google-research/electra/issues/114

I would be happy if you could reply to me when you know something and have a time.

richarddwang commented 3 years ago

Hi @miyamonz ,

According to my personal understanding about it. That is because there are two cases when finetuning. For exmaple,

CoLA task: (check grammatical correctness, one sentence per sample)
[CLS] This is an example . [SEP]

QQP task: (check whether two questions replicates, two sentences per sample)
[CLS] What are must eat cuisines around Nagoya University ? [SEP] What are recommended cuisines around Nagoya University ? [SEP]

So to let the pretrained model get used to both patterns and minimize the gap between pretraining and finetuning, in ELECTRA we randomly create single/two-segment examples.

Please tag me if you have any question.

miyamonz commented 3 years ago

thanks! it's so helpful for me.

richarddwang / electra_pytorch

Questtion about ELECTRADataProcessor or ExampleBuilder #11