How to freeze layers of bert?

shimafoolad commented 5 years ago

How to freeze all layers of Bert and just train task-based layers during the fine-tuning process? We can do it by setting the requires_grad=false for all layers In pytorch-pretrained-BERT. But is there any way in tensorflow code? I added below code to create_optimizer function in optimization.py

tvars = tf.trainable_variables()
tvars = [v for v in tvars if 'bert' not in v.name]   ## my code (freeze all layers of bert)
grads = tf.gradients(loss, tvars)

is that correct?

hsm207 commented 5 years ago

Yes.

You can also double check that this is correct by looking at the log while training. At the start of training, there is a part that lists the layers that are going to get trained.

hkvision commented 5 years ago

But seems the log printing trainable variables is before creating the optimizer. Thus using this way in the log all layers are still printed out?

shimafoolad commented 5 years ago

hkvision, it is so, I've also added the second line of code below to the model_fn_builder function to prevent printing of the log of BERT layers.

    tvars = tf.trainable_variables()
    tvars = [v for v in tvars if 'bert' not in v.name]         #my code 
    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      (assignment_map,
       initialized_variable_names) = modeling.get_assigment_map_from_checkpoint(
           tvars, init_checkpoint)
      if use_tpu:
        def tpu_scaffold():
          tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
          return tf.train.Scaffold()
        scaffold_fn = tpu_scaffold
      else:
        tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
    tf.logging.info("**** Trainable Variables ****")
    for var in tvars:
      init_string = ""
      if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
      tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                      init_string)

shimafoolad commented 5 years ago

Yes.

You can also double check that this is correct by looking at the log while training. At the start of training, there is a part that lists the layers that are going to get trained.

hsm207, creating optimizer (the following code) is after the log printing.

train_op = optimization.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

what's more, BERT trainable variables are still in the graph, but the gradients of BERT layers are not calculated using the following code. I'm not sure whether the BERT layers are frozen in the way! Should I delete them from the graph?

tvars = tf.trainable_variables()
tvars = [v for v in tvars if 'bert' not in v.name]   ## my code (freeze all layers of bert)
grads = tf.gradients(loss, tvars)

hkvision commented 5 years ago

I used this approach to only train the last fully connected layer of run_classifier.py example on MRPC dataset. But seems now the model doesn't converge and the eval accuracy is around 0.68, which is the same as the model without training. Have you trained the model with frozen weights? Where am I going wrong? Thanks so much in advance!

hsm207 commented 5 years ago

@shimafoolad

I don't understand your question but check out my fork of BERT.

This is the part that makes sure only the layers added on top of BERT are updated during finetuning.

I've also written a script to compare the weights given two checkpoint files and print the weights that differ. I finetuned BERT on CoLA and compare the checkpoint files at step 0 and 267. As expected, only the weights associated with output weights and output_bias are different:

I hope this answers your question.

@hkvision Try finetuning with longer epochs and higher learning rate. I finetuned on the CoLA dataset using default hyperparameters and here is my results after 5 epochs:

This is what I get after 50 epochs and progressively increasing the learning rate by a few orders of magnitude:

hkvision commented 5 years ago

Hi @hsm207 Thanks so much for your answer. I'm using MRPC dataset and not sure whether the result on CoLA is similar. For MRPC, training the whole BERT can easily reach accuracy 84%-88%. Is CoLA this case as well? If so, then freezing BERT seems to greatly impact the final accuracy?

hsm207 commented 5 years ago

@hkvision On CoLA, I can reach accuracy of around 83% on the dev set using BERT base and finetuning in the usual way.

I'm not sure about the impact on final accuracy. Since freezing BERT means there is a very limited number of parameters that can be updated, it will take longer to converge. Whether or not the finetuning on frozen BERT will eventually reach the same or better result as not freeing BERT I am not sure.

Just curious, what was your objective in finetuning only BERT's final dense layer?

hkvision commented 5 years ago

@hsm207 Thanks so much! I'm just doing some experiments on my own to see the impact of freezing BERT. As I'm running BERT on CPU, freezing BERT will be much faster... I understand that freezing BERT the number of parameters are really limited. At least we need to make more efforts to make the result compatible with not freezing BERT.

biuleung commented 5 years ago

@hsm207 @shimafoolad

I did a pre-training process for the model(BERT-Base, Uncased) and I got a checkpoint, graph.phtxt and series of model.ckpt checkpoint files, so everything was fine for the pre-training proccess. And the next step, I am now trying to fine tune the self-pre-trained model but I am encountering some error messages.

I0618 02:19:33.390300 139792308799360 saver.py:1280] Restoring parameters from gs://pretrainingpatent001/20190617_english_test/pretraining_output/model.ckpt-0 2019-06-18 02:19:34.437859: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key output_bias not found in checkpoint E0618 02:19:34.543154 139792308799360 error_handling.py:70] Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key output_bias not found in checkpoint [[node save/RestoreV2 (defined at run_classifier.py:880) ]] [[save/RestoreV2/_301]] (1) Not found: Key output_bias not found in checkpoint [[node save/RestoreV2 (defined at run_classifier.py:880) ]] 0 successful operations. 0 derived errors ignored.

_It said that the key "output_bias" could not be found in ckeckpoint, so I checked the "runpretraining.py" code it says:

Simple binary classification. Note that 0 is "next sentence" and 1 is "random sentence". This weight matrix is not used after pre-training. with tf.variable_scope("cls/seq_relationship"): output_weights = tf.get_variable( "output_weights", shape=[2, bert_config.hidden_size], initializer=modeling.create_initializer(bert_config.initializer_range)) output_bias = tf.get_variable( "output_bias", shape=[2], initializer=tf.zeros_initializer())

I found out that there is one more layer upon the output model that the "run_classifier.py" doesn't need it or graph.phtxt and the checkpoint are inconsistent.

So, my question is that how do I avoid these errors, and I just want to fine tune the self-pre trained model.

I will be grateful for any help you can provide.

hsm207 commented 5 years ago

@biuleung I've never fine-tuned a self pre-trained model so I'm not sure if this is a bug in the original implementation or an error coming from your own modifications. A good starting point to solve your problem is to understand how checkpoints work. This guide is a great resource for that.

Have you tried the pyTorch implementation?

biuleung commented 5 years ago

Thanks for your reply and recommendation! I just simply ran the original code (run_pretraining.py) and fine tuned the self pre-trained model by the orginal run_classifier.py. I found out that the pre-trained model has one more layer(cls/seq_relationship and cls/prediction) above the model, so I think it is the reason why it said there were name scope issues. I then added the name scope statement before the declaration of output_bias in run_classifier.py. It works but the results of the predictions were weird.

nlp4whp commented 5 years ago

@shimafoolad

I don't understand your question but check out my fork of BERT.

This is the part that makes sure only the layers added on top of BERT are updated during finetuning.

I've also written a script to compare the weights given two checkpoint files and print the weights that differ. I finetuned BERT on CoLA and compare the checkpoint files at step 0 and 267. As expected, only the weights associated with output weights and output_bias are different:

I hope this answers your question.

@hkvision Try finetuning with longer epochs and higher learning rate. I finetuned on the CoLA dataset using default hyperparameters and here is my results after 5 epochs:

This is what I get after 50 epochs and progressively increasing the learning rate by a few orders of magnitude:

thanks a ton !! really helpful

OYE93 commented 5 years ago

@shimafoolad

I don't understand your question but check out my fork of BERT.

This is the part that makes sure only the layers added on top of BERT are updated during finetuning.

I've also written a script to compare the weights given two checkpoint files and print the weights that differ. I finetuned BERT on CoLA and compare the checkpoint files at step 0 and 267. As expected, only the weights associated with output weights and output_bias are different:

I hope this answers your question.

@hkvision Try finetuning with longer epochs and higher learning rate. I finetuned on the CoLA dataset using default hyperparameters and here is my results after 5 epochs:

This is what I get after 50 epochs and progressively increasing the learning rate by a few orders of magnitude:

Hi @hsm207 I have a question about how to freeze some layers in bert, for example I just want to freeze the paras before layer 11, how to complete it, thanks. :)

hsm207 commented 5 years ago

@OYE93

Have a look at this line.

tvars now contains a list of all weights outside BERT. You will need to add to it the params from layer 11 onwards. You can check the checkpoint files for how these weights are named.

google-research / bert

How to freeze layers of bert? #637