Closed nlp-sudo closed 3 years ago
Thanks for reporting problems.
(1) What are your GPU command lines? The training is very expensive and the paper was using TPUs. https://github.com/tensorflow/models/tree/master/official/nlp/nhnet#tpu The command line in the readme for GPU is just a toy example. (2) In terms of the scale loss optimizer, overriding keras.Model has complex interactions. This might be a design problem to figure out for how to configure this. @reedwm
I used the same command line mentioned in the ReadMe [Just reduced train_batch_size from 16 to 8]. Once I also tried changing steps_per_loop from 1 to 10000 but that also didn't help.
Can you let me know, how can I train the model again on GPU? and if that's not possible, can we get pre-trained model checkpoint for usage?
steps_per_loop only control the inner loop running on this device. it does not affect total number of steps to train the model. The TPU command line is closer to the paper setting.
I am afraid the checkpoint release is not approved. @remenberl
Thanks for reporting the problem.
Some reproduction details below: When we train our headline model on TPUv3 and the batch size is 1024. More importantly, we have a pretrain stage to initialize the seq2seq model as reported in the paper.
The gap is significant if batch size is small and lacking of a pre-trained encoder-decoder. I would suggest you pretrain the seq2seq model (BERT2BERT) first before running NHNet.
Okay, I will pretrain a seq2seq model with a huge corpus. I am supposing, I would have to pass the saved checkpoint of the pre-trained seq2seq as --init_checkpoint but in the paper, it is written that the pre-trained seq2seq model is used to initialize just the encoder module and the word embedding layer and not the decoder module. Do I need to handle that separately?
PS: As mentioned in the README, you have used TPU-v3-64 which normal researchers can't get access to. So, it would be great, if any pre-trained checkpoint is published.
There are multiple stages in order to get the best performance as shown in Figure 3, among which the single-doc pretraining
is more crucial than multi-doc distant supervision
assuming you have access to BERT pretraining checkpoint.
Based on this, I suggest to train a seq2seq model with BERT pretraining checkpoint. And then you can train NHNet with this seq2seq checkpoint.
Unfortunately, pre-trained checkpoint cannot be published due to the company policy. You see that the released dataset only contains URL but not title/body due to this reason as well.
The LossScaleOptimizer issues comes from the fact that we call a private _decayed_lr
method on the optimizer, which is not supported on the LossScaleOptimizer:
return {
"training_loss": loss,
"learning_rate": self.optimizer._decayed_lr(var_dtype=tf.float32)
}
I am working on resolving this for TensorFlow 2.4. In the meantime, you can replace the above snippet with:
if isinstance(optimizer,
tf.keras.mixed_precision.experimental.LossScaleOptimizer):
inner_optimizer = optimizer._optimizer
else:
inner_optimizer = optimizer
return {
"training_loss": loss,
"learning_rate": inner_optimizer._decayed_lr(var_dtype=tf.float32)
}
Note this snippet relies on accessing the private _optimizer
memeber of LossScaleOptimizer, so it may break in TF 2.4. But the code already relies on the private _decayed_lr
method so it may break in TF 2.4 anyway.
_decayed_lr is probably not needed. We can use:
if callable(self.optimizer.learning_rate):
logs['learning_rate'] = self.optimizer.learning_rate(self.global_step)
else:
logs['learning_rate'] = self.optimizer.learning_rate
Thanks for the input. I will try changing the loss like this. But I think, dtype for most of the model layers are explicitly set to float32, so fp16 might not help. I will report my observations, once I try it
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/official/nlp/nhnet
2. Describe the bug
I tried to replicate results of NHNet as shown in paper but I was able to get maximum an accuracy of only 30%. The only thing different I did from the steps mentioned in repo is using batch_size of 8 instead of 16 for GPU training, This should not cause such vast difference in the accuracy result but then that's the only change. I crawled the data for more than 5 days to be on safer side. I trying to use Mixed Precision technique, so that I can provide bigger batch size. But when I use
tf.keras.mixed_precision.experimental.LossScaleOptimizer(opt,"dynamic")
It gives the following error.AttributeError: 'LossScaleOptimizer' object has no attribute '_hypers_created'
I updated the train_step function based on guidelines here: https://www.tensorflow.org/guide/mixed_precision?hl=en#training_the_model_with_a_custom_training_loop
3. Steps to reproduce
Update the trainer.py file with this:
4. Expected behavior
Training should start like it did before using FP16
5. Additional context
Error message when I run this above code: AttributeError: 'LossScaleOptimizer' object has no attribute '_hypers_created'
Also it has some warnings like this:
6. System information