google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

Format of corpus #43

Closed mahnerak closed 4 years ago

mahnerak commented 4 years ago

According to the paper, ELECTRA does not involve NSP (next sentence prediction) task. In that case, do we need sentence segmentation? Does build_pretraining_dataset.py consider each line as a separate sentence? Or can we just feed raw text (with empty lines as separators for documents) ?

clarkkev commented 4 years ago

Sentence segmentation isn't strictly needed to pre-train ELECTRA, but having segment ids and [SEP] tokens passed to the model is probably helpful for fine-tuning the model on downstream tasks like NLI and QA where the input consists of distinct segments.

build_pretraining_dataset.py was built for data where each line contains one sentence. However, I'd guess that pre-training would still work fine if you split up your documents into fixed-length segments of 30 words or so and built a dataset with each line containing the next 30 words from the document.

mahnerak commented 4 years ago

Thank you, @clarkkev! One more question - what was the approach of the original paper(and released pretrained models)? Do you use segmented sentences (like original BERT) or fixed-length segments you suggested in your comment?

clarkkev commented 4 years ago

We used segmented sentences like the original BERT.

mahnerak commented 4 years ago

Thanks!