Format of corpus - Githubissues

mahnerak commented 4 years ago

According to the paper, ELECTRA does not involve NSP (next sentence prediction) task. In that case, do we need sentence segmentation? Does build_pretraining_dataset.py consider each line as a separate sentence? Or can we just feed raw text (with empty lines as separators for documents) ?

clarkkev commented 4 years ago

Sentence segmentation isn't strictly needed to pre-train ELECTRA, but having segment ids and [SEP] tokens passed to the model is probably helpful for fine-tuning the model on downstream tasks like NLI and QA where the input consists of distinct segments.

build_pretraining_dataset.py was built for data where each line contains one sentence. However, I'd guess that pre-training would still work fine if you split up your documents into fixed-length segments of 30 words or so and built a dataset with each line containing the next 30 words from the document.

mahnerak commented 4 years ago

Thank you, @clarkkev! One more question - what was the approach of the original paper(and released pretrained models)? Do you use segmented sentences (like original BERT) or fixed-length segments you suggested in your comment?

clarkkev commented 4 years ago

We used segmented sentences like the original BERT.

mahnerak commented 4 years ago

Thanks!

google-research / electra

Format of corpus #43