Learning Rate fine-tuning

MeTaNoV commented 6 years ago

Hello,

I would like to experiment with TF Hub to retrain an image classifier. The retrain example is a good starting point for that purpose. In the example, you have the ability to fix a learning rate for the final layer. Also, using hub.Module(..., trainable=True), you can also let the pre-trained weights to be updated. My question is: which learning rate will be applied in that case (inherited from the one specified on the final layer?), and how to change it if possible, and use a different one from the one in the final layer.

Thanks in advance!

arnoegw commented 6 years ago

Hi Pascal,

I'm afraid you will have to move past the retrain.py example, because it really is designed around a frozen image module with potentially cached bottleneck values. Even if not cached, the way they are passed into the Session.run() call that does the training prevents backprop through the module.

There is no publicly available example, but I can offer some general advice:

For the training graph, use hub.Module(..., trainable=True, tags={"train"}) to get the graph version that operates batch norm in training mode. (In retrain.py, this would conflict with the use of continuous eval in the training graph.) Also, trainable=True brings in REGULARIZATION_LOSSES from the hidden layers (if any); use those when you see overfitting.
Use data augmentation. For that, retrain.py is a workable starting point.
Fine-tuning the whole network benefits from more sophisticated approaches than plain SGD, esp. the use of momentum and learning rate decay. As a rule of thumb, I'd recommend to start form the training regime in the architecture's original paper (referenced from the module documentation), with the initial learning rate cut by 10. Be aware that many uses of RMSProp for image models set epsilon=1.0 (which is huge), and attenuates the AdaGrad-style scaling.
For performance, consider using a GPU, and/or multiple machines. If you want to try in-graph replication for multiple GPUs, be sure to reuse one Module object across the different towers, in order to share variables. Also, for performance, consider training only the upper layers of the module (e.g., by filtering trainable variables according to scope names).

Happy coding/training!

Arno

On 13 April 2018 at 17:24, Pascal Gula notifications@github.com wrote:

Hello, I would like to experiment with TF Hub to retrain an image classifier. The retrain example is a good starting point for that purpose. In the example, you have the ability to fix a learning rate for the final layer. Also, using hub.Module(..., trainable=True), you can also let the pre-trained weights to be updated. My question is: which learning rate will be applied in that case, and how to change it if possible? Thanks in advance!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/hub/issues/24, or mute the thread https://github.com/notifications/unsubscribe-auth/AjkBVQSbLN4jZknpwOY2bsBxBr_McYxGks5toMNCgaJpZM4TTo_d .

-- Google Switzerland GmbH

moono commented 6 years ago

Hello, @arnoegw,

I have question about usage of REGULARIZATION_LOSSES. Is it like following? Am I on right track?

...

module = hub.Module('...', trainable=True, tags={'train'})
start_from = module(inputs)
logits = tf.layers.dense(start_from, units=n_output_class, activation=None)
...

loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss += tf.add_n(reg_losses)

arnoegw commented 6 years ago

Hi @moono

Yup, this looks right (for training -- be sure to not set tags={'train'} during eval or inference).

TensorFlow offers some syntactic sugar for getting the regularization losses: tf.losses.get_regularization_losses() for the list, tf.losses.get_regularization_loss() for its sum.

moono commented 6 years ago

@arnoegw

I was seeing much higher train accuracy than evaluate accuracy. But as you mentioned, removing tags={"train"} fixed the issue. Thank you so much :)

alabatie commented 6 years ago

Hello @arnoegw and @moono,

I am a colleague of @MeTaNoV. As suggested by his question, we are currently trying to fine-tune a pretrained inception-v3 from TF Hub on our specific classification task. Our first goal (that we haven't yet achieved) is simply to reproduce the results previously obtained with the Caffe framework.

Following your response, we implemented a train graph that instantiates the TF module with hub.Module(module_spec, trainable=True, tags={"train"}) and a test graph that instantiates the TF module with hub.Module(module_spec). As in our Caffe implementation, we reduce the learning rate for the convolutional layers by a factor 10 compared to the final classification layer, using the following trick before the classification layer:cnn_output_tensor = 1/10. * cnn_output_tensor + (1-1/10.0) * tf.stop_gradient(cnn_output_tensor)

An additional important problem was related to the batch normalization layers. In order to work correctly at test time, the moving averages for the batch means and variances need to be updated during training. It seems that these updates are not done by default, which requires to either perform manually the update_op or to include it in a control dependency. Here is what we implemented to automatically perform the updates: update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) updates = tf.group(*update_ops) cross_entropy = control_flow_ops.with_dependencies([updates], cross_entropy)

Even with this implementation, we still don't manage to reproduce our previous results with Caffe.

On today I implemented to fetch variables from batch normalization layers, and to write their histograms in summaries for tensorboard visualization. The visualization shows that moving averages are indeed updated during training, but it also shows that beta variables seem to be fixed throughout training.

I understand that gamma variables are not present since they are redundant with the next convolutional layers in case of ReLU activation. However I would expect beta variables to be very important before a ReLU activation. And I would expect that the normalization effect of batch normalization layers combined with non-trainable beta variables is very detrimental (from our tests, it seems we loose ~4% in our final top-1 accuracy). Is this analysis correct ? Would you have a fix for this ?

Thank you very much in advance. Antoine

arnoegw commented 6 years ago

Hi Antoine,

what you describe is a general TensorFlow subtlety about running UPDATE_OPS. As far as I know, it's all the same whether they come out of a TensorFlow Hub module, or directly from Python code using batch normalization.

Usually, training is done with a train_op that combines the gradient updates from the optimizer with the elements of the UPDATE_OPS collection. The helper function tf.contrib.training.create_train_op does that by returning a train_op that is the total_loss with control_dependencies on both the update_ops and the grad_updates.

I recommend to do something similar in your code.

Just putting a control dependency on the loss does not automatically put a control dependency on its gradient; cf. the final example snippet in the API docs for tf.Graph.control_dependencies.

I agree that not running UPDATE_OPS to keep the moving averages of batch norm for inference in sync with the per-batch statistics seen during training (or fine-tuning) will likely cause a serious degradation of quality.

Hope that helps, Arno

alabatie commented 6 years ago

Thank you very much Arno for the quick answer.

I don't think there's a problem with our moving average updates, since we can now visualize these variables evolving during the training.

What concerned me was beta variables that didn't seem to be updated. However I managed to spot slight variations of beta (probably only slight due to how small we set the learning rate in the module part) in my latest visualizations: https://screenshots.firefox.com/Gc0s298lAiaIFpIP/localhost

This means that these layers are correctly trained. Thus we are still wondering why we can't reproduce the results we obtained with Caffe.

arnoegw commented 6 years ago

Hi Antoine, I'm glad you could clear the UPDATE_OPS issue (If you evaluate cross_entropy at every step for loss reporting, your code will work, albeit not from backprop alone.), and also the training of beta (batch norm's learned output mean).

Are you still seeing a difference to Caffe? That's such a wide question, it's hard for me to answer. REGULARIZATION_LOSSES of the module were already discussed upthread. There might also be differences in regularizing the classifier you put on top (dropout?, weight decay?), data augmentation, the optimizer and its learning rate schedule, Polyak averaging of the model weights, ...

rsethur commented 6 years ago

Hello @alabatie can you share your findings, please? I'm leveraging TF Hub as well and would appreciate your findings.

himaprasoonpt commented 5 years ago

If I am using a hub module as follows module = hub.Module('...', trainable=True, tags={'train'}) module_out = module(input) layer2 = somelayer(moduleout) Defines losses and optimizer After training is complete, when I run layer2(final layer) in infer mode, should I change the module tag? If yes how can I do that? Should I be using some sort of placeholder to switch tag? The batch norm mode has to be changed right? @arnoegw

arnoegw commented 5 years ago

Hello @himaprasoonpt, please see this StackOverflow answer answer. (In short: Solutions differ for TF1 and TF2. In TF1, you'd need to checkpoint weights and restore into a new graph built with switched tags.)

tensorflow / hub

Learning Rate fine-tuning #24