I have stated working on Bert model. Do anyone know what was Bert pre-training accuracy(not fine tuned) using 100-0-0 masking approach vs 80-10-10 approach. I could not get it anywhere.
Basically I understand why 80-10-10 approach is implemented but did they do any experiments to figure this out
Hi,
I have stated working on Bert model. Do anyone know what was Bert pre-training accuracy(not fine tuned) using 100-0-0 masking approach vs 80-10-10 approach. I could not get it anywhere. Basically I understand why 80-10-10 approach is implemented but did they do any experiments to figure this out