Thank you very much for the article. After that, I wanted to understand BERT more deeply and found the following thing in your code.
For fine tune, you use the following line of code:
trainable_vars = self.bert.variables
trainable_vars = trainable_vars [-self.n_fine_tune_layers:]
However, self.bert.variables returns the list sorted by variable names, and therefore the 11th block of the transformer goes before 9. And with fine tune, intermediate layers are trained when the others are completely frozen.
Thank you very much for the article. After that, I wanted to understand BERT more deeply and found the following thing in your code. For fine tune, you use the following line of code: trainable_vars = self.bert.variables trainable_vars = trainable_vars [-self.n_fine_tune_layers:] However, self.bert.variables returns the list sorted by variable names, and therefore the 11th block of the transformer goes before 9. And with fine tune, intermediate layers are trained when the others are completely frozen.
bert.variables return