google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

[question] Cross-layer parameter sharing #200

Open pglushkov opened 4 years ago

pglushkov commented 4 years ago

Hi all, current issue is not a concrete issue I found in the code, but more of a code understanding question. If you find this type of question inappropriate here - please accept my apologies in advance.

The question is about parameters sharing. It is stated in original article describing ALBERT that all hidden FFNs and attention layers will share their parameters with each other. During inference time, when the network is initialized, I kinda see this happening in run_pretraining.py when we work with num_of_initialize_group variable and probably assign same weights to all layers within the same group. However during training, when we run the modelling.py::transformer_model(), we distinctly create new attention/dense layers in the loop for each of [1 ... num_hidden_layers]. Wont that force them to be optimized separately during training? If so, where does the 'parameter sharing' part comes into picture? Thank you for your answers!

Cheers!

eddie-scio commented 3 years ago

Yup, just noticed that the get_assignment_map_from_checkpoint logic looks like it was forked from BERT-style initialization (with no cross-layer parameter sharing) instead of the ALBERT_style initialization. My inference graph / checkpoint still looks the appropriate size, though, so it seems there might be a training / serving skew.