a question about `num_attention_heads`

google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Apache License 2.0

3.24k stars 569 forks source link

a question about `num_attention_heads` #153

Closed beamind closed 4 years ago

beamind commented 4 years ago

According to part 3.1 of paper, for xxlarge model, the num_attention_heads=H/64=4096/64=64. But according to albert_config.json file from albert_xxlarge_zh.tar.gz, the num_attention_heads=16. So which is correct? Thanks!

akakakakakaa commented 4 years ago

4096 is intermediate size not hidden size.

beamind commented 4 years ago

4096 is intermediate size not hidden size.

for xxlarge, "hidden_size": 4096, "intermediate_size": 16384. This is from albert_config.json file from albert_xxlarge_zh.tar.gz

akakakakakaa commented 4 years ago

Sorry, I checked.

When I downloaded v2 version, num_attention_heads is 64.

Danny-Google commented 4 years ago

yes, we later found that 16 heads give better performance and switch to 16 heads in zh version. Our new paper about talking heads should fix the problem that more heads gives worse performance (https://arxiv.org/abs/2003.02436).