Closed beamind closed 4 years ago
4096 is intermediate size not hidden size.
4096 is intermediate size not hidden size.
for xxlarge, "hidden_size": 4096, "intermediate_size": 16384. This is from albert_config.json file from albert_xxlarge_zh.tar.gz
Sorry, I checked.
When I downloaded v2 version, num_attention_heads is 64.
yes, we later found that 16 heads give better performance and switch to 16 heads in zh version. Our new paper about talking heads should fix the problem that more heads gives worse performance (https://arxiv.org/abs/2003.02436).
According to part 3.1 of paper, for xxlarge model, the
num_attention_heads
=H/64=4096/64=64. But according toalbert_config.json
file fromalbert_xxlarge_zh.tar.gz
, thenum_attention_heads
=16. So which is correct? Thanks!