Closed santaonchair closed 4 years ago
We used 15% for most experiments to keep things consistent with BERT. However, with 15% masking we were ending up with only about 7% of tokens being "replaced" rather than "original" because ELECTRA-Large's generator gets >50% accuracy. We thought this level of imbalance might hurt performance, so we increased the mask percent for the large model, which helped results a bit. However, we haven't tried other mask percents with base/small models.
Hi :)
masking rate (mask_prob) is 15% in base and small model. But in large model, the masking rate is 25%. and it makes me curious. would higher masking rate increase model performance?