tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.18k stars 3.45k forks source link

Simple transfer learning #1469

Open goodmansasha opened 5 years ago

goodmansasha commented 5 years ago

Description

I'm attempting to do very basic transfer learning using a transformer, and am asking if someone could point me towards an example of how to do that in tensor2tensor.

I've seen Radford et al's work ( https://blog.openai.com/language-unsupervised/ ), which is inspiring. However, I see no hyperparameters in tensor2tensor to simply freeze certain keras layers by setting trainable=False.

This basic idea is to pre-train a model first using a lower quality dataset created by another computer program (i.e. a "silver" dataset), and then to finetune that model using a hand curated "gold" dataset created by a human expert. The silver dataset is much larger than the gold, but the gold takes a long time to produce. The idea is that after the "silver model" is trained on the silver dataset, most of its layers are frozen except the final layer, and then it is trained on the gold data. That way, the model does not forget what it learned from the silver dataset. I tried doing this without freezing layers and unfortunately the model started to degrade and lose what it learned form the silver data.

...

Environment information

OS: Ubuntu 18.04.1 LTS

$ pip3 freeze | grep tensor
# your output here
mesh-tensorflow==0.0.5
tensor2tensor==1.12.0
tensorboard==1.12.2
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0

$ python3 -V
# your output here

Python 3.6.7

For bugs: reproduction and error logs

# Steps to reproduce:
...
# Error logs:
...
JohannesTK commented 5 years ago

I don't know what's the problem you are trying to solve but in general to achieve good results with transfer learning in Tensor2Tensor:

  1. Train a language model such as LanguagemodelWikitext103 or LanguagemodelEnWiki64k. The latter will result in a bigger model.
  2. Make sure to use correct hyper parameters for training the LM such as transformer_tall_pretrain_lm.
  3. Generate your final problem dataset using the LM vocabulary. Example how to do it.
  4. Train your final problem with warm start from the LM checkpoint using --warm_start_from=/path/to/lm/checkpoint.
stefan-falk commented 4 years ago

@JohannesTK do you have any detailed information about what actually causes layers to be frozen? Looking at the hparams I can only see some parameters connected to multi-problem but not actually pretraining in the sense of freezing parts of a model after pretraining:

@registry.register_hparams
def transformer_tall_finetune_textclass():
  """Hparams for transformer on LM for finetuning on text class problems."""
  hparams = transformer_tall()
  hparams.learning_rate_constant = 6.25e-5
  hparams.learning_rate_schedule = ("linear_warmup*constant*linear_decay")
  hparams.multiproblem_schedule_max_examples = 0
  hparams.multiproblem_target_eval_only = True
  hparams.learning_rate_warmup_steps = 50
  # Set train steps to learning_rate_decay_steps or less
  hparams.learning_rate_decay_steps = 25000
  hparams.multiproblem_reweight_label_loss = True
  hparams.multiproblem_label_weight = 0.95
  return hparams

I am currently experimenting myself with multi-problem (see https://github.com/tensorflow/tensor2tensor/issues/1687) but similar to @goodmansasha I would like to know how to freeze e.g. the entire encoder of the Transformer or whether this is still something that has to be done in t2t.

Can you provide any information on that?

gabegrand commented 4 years ago

@stefan-falk were you able to figure out how to freeze layers in T2T? I have a similar scenario where I have a pretrained transformer, and I'd like to freeze the base layers during finetuning. It doesn't look like T2T supports this out-of-box, but maybe you found a way to accomplish this manually?

stefan-falk commented 4 years ago

@gabegrand Unfortunately now. I had to abandon this approach for now.