google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

mask prob in large model #71

Closed santaonchair closed 4 years ago

santaonchair commented 4 years ago

Hi :)

masking rate (mask_prob) is 15% in base and small model. But in large model, the masking rate is 25%. and it makes me curious. would higher masking rate increase model performance?

clarkkev commented 4 years ago

We used 15% for most experiments to keep things consistent with BERT. However, with 15% masking we were ending up with only about 7% of tokens being "replaced" rather than "original" because ELECTRA-Large's generator gets >50% accuracy. We thought this level of imbalance might hurt performance, so we increased the mask percent for the large model, which helped results a bit. However, we haven't tried other mask percents with base/small models.