nlp-sudo commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/nlp/nhnet

2. Describe the bug

I tried to replicate results of NHNet as shown in paper but I was able to get maximum an accuracy of only 30%. The only thing different I did from the steps mentioned in repo is using batch_size of 8 instead of 16 for GPU training, This should not cause such vast difference in the accuracy result but then that's the only change. I crawled the data for more than 5 days to be on safer side. I trying to use Mixed Precision technique, so that I can provide bigger batch size. But when I use tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt,"dynamic") It gives the following error. AttributeError: 'LossScaleOptimizer' object has no attribute '_hypers_created'

I updated the train_step function based on guidelines here: https://www.tensorflow.org/guide/mixed_precision?hl=en#training_the_model_with_a_custom_training_loop

3. Steps to reproduce

Update the trainer.py file with this:

class Trainer(tf.keras.Model):
  """A training only model."""

  def __init__(self, model, params):
    super(Trainer, self).__init__()
    self.model = model
    self.params = params
    self._num_replicas_in_sync = tf.distribute.get_strategy(
    ).num_replicas_in_sync

  def call(self, inputs, mode="train"):
    return self.model(inputs, mode)

  def train_step(self, inputs):
    """The logic for one training step."""
    with tf.GradientTape() as tape:
      logits, _, _ = self(inputs, mode="train", training=True)
      targets = models.remove_sos_from_seq(inputs["target_ids"],
                                           self.params.pad_token_id)
      loss = transformer_metrics.transformer_loss(logits, targets,
                                                  self.params.label_smoothing,
                                                  self.params.vocab_size)
      # Scales the loss, which results in using the average loss across all
      # of the replicas for backprop.
      scaled_loss = self.optimizer.get_scaled_loss(loss) / self._num_replicas_in_sync

    tvars = self.trainable_variables
    grads = self.optimizer.get_unscaled_gradients(tape.gradient(scaled_loss, tvars))
    self.optimizer.apply_gradients(list(zip(grads, tvars)))
    return {
        "training_loss": loss,
        "learning_rate": self.optimizer._decayed_lr(var_dtype=tf.float32)
    }

def train(params, strategy, dataset=None):
  """Runs training."""

  if not dataset:
    dataset = input_pipeline.get_input_dataset(
        FLAGS.train_file_pattern,
        FLAGS.train_batch_size,
        params,
        is_training=True,
        strategy=strategy)

  with strategy.scope():
    model = models.create_model(
        FLAGS.model_type, params, init_checkpoint=FLAGS.init_checkpoint)
    opt = optimizer.create_optimizer(params)
    opt = tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt,"dynamic")
    trainer = Trainer(model, params)
    model.global_step = opt.iterations

    trainer.compile(
        optimizer=opt,
        experimental_steps_per_execution=FLAGS.steps_per_loop)
    summary_dir = os.path.join(FLAGS.model_dir, "summaries")
    summary_callback = tf.keras.callbacks.TensorBoard(
        summary_dir, update_freq=max(100, FLAGS.steps_per_loop))
    checkpoint = tf.train.Checkpoint(model=model, optimizer=opt)
    checkpoint_manager = tf.train.CheckpointManager(
        checkpoint,
        directory=FLAGS.model_dir,
        max_to_keep=10,
        step_counter=model.global_step,
        checkpoint_interval=FLAGS.checkpoint_interval)
    if checkpoint_manager.restore_or_initialize():
      logging.info("Training restored from the checkpoints in: %s",
                   FLAGS.model_dir)
    checkpoint_callback = keras_utils.SimpleCheckpoint(checkpoint_manager)

  # Trains the model.
  steps_per_epoch = min(FLAGS.train_steps, FLAGS.checkpoint_interval)
  epochs = FLAGS.train_steps // steps_per_epoch
  history = trainer.fit(
      x=dataset,
      steps_per_epoch=steps_per_epoch,
      epochs=epochs,
      callbacks=[summary_callback, checkpoint_callback],
      verbose=2)
  train_hist = history.history
  # Gets final loss from training.
  stats = dict(training_loss=float(train_hist["training_loss"][-1]))
  return stats

4. Expected behavior

Training should start like it did before using FP16

5. Additional context

Error message when I run this above code: AttributeError: 'LossScaleOptimizer' object has no attribute '_hypers_created'

Also it has some warnings like this:

W0917 14:16:15.056954 140018629465920 ag_logging.py:146] AutoGraph could not transform <bound method TransformerDecoder.call of <official.nlp.nhnet.decoder.TransformerDecoder object at 0x7f573c702350>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method TransformerDecoderLayer.call of <official.nlp.modeling.layers.transformer.TransformerDecoderLayer object at 0x7f573c67f750>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0917 14:16:15.077228 140018629465920 ag_logging.py:146] AutoGraph could not transform <bound method TransformerDecoderLayer.call of <official.nlp.modeling.layers.transformer.TransformerDecoderLayer object at 0x7f573c67f750>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method CachedAttention.call of <official.nlp.modeling.layers.attention.CachedAttention object at 0x7f573c63ead0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0917 14:16:15.085375 140018629465920 ag_logging.py:146] AutoGraph could not transform <bound method CachedAttention.call of <official.nlp.modeling.layers.attention.CachedAttention object at 0x7f573c63ead0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method MultiChannelAttention.call of <official.nlp.modeling.layers.multi_channel_attention.MultiChannelAttention object at 0x7f573c60af90>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0917 14:16:15.138240 140018629465920 ag_logging.py:146] AutoGraph could not transform <bound method MultiChannelAttention.call of <official.nlp.modeling.layers.multi_channel_attention.MultiChannelAttention object at 0x7f573c60af90>> and will run it as-is.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
TensorFlow installed from (source or binary): tf-nightly==2.4.0.dev20200724
TensorFlow version (use command below):v1.12.1-31004-g203aa8b634 2.2.0-dev20200501
Python version: 3.7.7
CUDA/cuDNN version: 10.1/7
GPU model and memory: TITAN RTX 24GB

saberkun commented 4 years ago

Thanks for reporting problems.

(1) What are your GPU command lines? The training is very expensive and the paper was using TPUs. https://github.com/tensorflow/models/tree/master/official/nlp/nhnet#tpu The command line in the readme for GPU is just a toy example. (2) In terms of the scale loss optimizer, overriding keras.Model has complex interactions. This might be a design problem to figure out for how to configure this. @reedwm

nlp-sudo commented 4 years ago

I used the same command line mentioned in the ReadMe [Just reduced train_batch_size from 16 to 8]. Once I also tried changing steps_per_loop from 1 to 10000 but that also didn't help.

Can you let me know, how can I train the model again on GPU? and if that's not possible, can we get pre-trained model checkpoint for usage?

saberkun commented 4 years ago

steps_per_loop only control the inner loop running on this device. it does not affect total number of steps to train the model. The TPU command line is closer to the paper setting.

I am afraid the checkpoint release is not approved. @remenberl

remenberl commented 4 years ago

Thanks for reporting the problem.

Some reproduction details below: When we train our headline model on TPUv3 and the batch size is 1024. More importantly, we have a pretrain stage to initialize the seq2seq model as reported in the paper.

The gap is significant if batch size is small and lacking of a pre-trained encoder-decoder. I would suggest you pretrain the seq2seq model (BERT2BERT) first before running NHNet.

nlp-sudo commented 4 years ago

Okay, I will pretrain a seq2seq model with a huge corpus. I am supposing, I would have to pass the saved checkpoint of the pre-trained seq2seq as --init_checkpoint but in the paper, it is written that the pre-trained seq2seq model is used to initialize just the encoder module and the word embedding layer and not the decoder module. Do I need to handle that separately?

PS: As mentioned in the README, you have used TPU-v3-64 which normal researchers can't get access to. So, it would be great, if any pre-trained checkpoint is published.

remenberl commented 4 years ago

There are multiple stages in order to get the best performance as shown in Figure 3, among which the single-doc pretraining is more crucial than multi-doc distant supervision assuming you have access to BERT pretraining checkpoint.

Based on this, I suggest to train a seq2seq model with BERT pretraining checkpoint. And then you can train NHNet with this seq2seq checkpoint.

Unfortunately, pre-trained checkpoint cannot be published due to the company policy. You see that the released dataset only contains URL but not title/body due to this reason as well.

reedwm commented 4 years ago

The LossScaleOptimizer issues comes from the fact that we call a private _decayed_lr method on the optimizer, which is not supported on the LossScaleOptimizer:

    return {
        "training_loss": loss,
        "learning_rate": self.optimizer._decayed_lr(var_dtype=tf.float32)
    }

I am working on resolving this for TensorFlow 2.4. In the meantime, you can replace the above snippet with:

    if isinstance(optimizer,
                  tf.keras.mixed_precision.experimental.LossScaleOptimizer):
      inner_optimizer = optimizer._optimizer
    else:
      inner_optimizer = optimizer
    return {
        "training_loss": loss,
        "learning_rate": inner_optimizer._decayed_lr(var_dtype=tf.float32)
    }

Note this snippet relies on accessing the private _optimizer memeber of LossScaleOptimizer, so it may break in TF 2.4. But the code already relies on the private _decayed_lr method so it may break in TF 2.4 anyway.

saberkun commented 4 years ago

_decayed_lr is probably not needed. We can use:

    if callable(self.optimizer.learning_rate):
      logs['learning_rate'] = self.optimizer.learning_rate(self.global_step)
    else:
      logs['learning_rate'] = self.optimizer.learning_rate

nlp-sudo commented 4 years ago

Thanks for the input. I will try changing the loss like this. But I think, dtype for most of the model layers are explicitly set to float32, so fp16 might not help. I will report my observations, once I try it

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / models

Not able to replicate NHNet results & FP16 not working for it. #9262

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information