[question] Cross-layer parameter sharing

Hi all, current issue is not a concrete issue I found in the code, but more of a code understanding question. If you find this type of question inappropriate here - please accept my apologies in advance.

The question is about parameters sharing. It is stated in original article describing ALBERT that all hidden FFNs and attention layers will share their parameters with each other. During inference time, when the network is initialized, I kinda see this happening in run_pretraining.py when we work with num_of_initialize_group variable and probably assign same weights to all layers within the same group. However during training, when we run the modelling.py::transformer_model(), we distinctly create new attention/dense layers in the loop for each of [1 ... num_hidden_layers]. Wont that force them to be optimized separately during training? If so, where does the 'parameter sharing' part comes into picture? Thank you for your answers!

Cheers!

google-research / albert

[question] Cross-layer parameter sharing #200