Open pglushkov opened 4 years ago
Yup, just noticed that the get_assignment_map_from_checkpoint
logic looks like it was forked from BERT-style initialization (with no cross-layer parameter sharing) instead of the ALBERT_style initialization. My inference graph / checkpoint still looks the appropriate size, though, so it seems there might be a training / serving skew.
Hi all, current issue is not a concrete issue I found in the code, but more of a code understanding question. If you find this type of question inappropriate here - please accept my apologies in advance.
The question is about parameters sharing. It is stated in original article describing ALBERT that all hidden FFNs and attention layers will share their parameters with each other. During inference time, when the network is initialized, I kinda see this happening in run_pretraining.py when we work with num_of_initialize_group variable and probably assign same weights to all layers within the same group. However during training, when we run the modelling.py::transformer_model(), we distinctly create new attention/dense layers in the loop for each of [1 ... num_hidden_layers]. Wont that force them to be optimized separately during training? If so, where does the 'parameter sharing' part comes into picture? Thank you for your answers!
Cheers!