tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.76k forks source link

Transformer average checkpoints and incremental training / domain adaptation #8784

Closed soares-f closed 1 year ago

soares-f commented 4 years ago

Prerequisites

Please answer the following question for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer_main.py

2. Describe the feature you request

It would be interesting to have the following features:

  1. Checkpoint averaging
  2. Incremental training for domain adaptation

3. Additional context

Feature 1 (checkpoint averaging) is implemented in Tensor2Tensor (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py) but I'm not exactly sure how to adapt that to the transformer code in this repository.

Feature 2: I'm not exactly sure how to do incremental training with this model. Theoretically, I'd only need the new data, but I'm aware that the optimizer parameters should be "reset" to an initial state. However, by inspecting the code I cannot find it straight away how to perform that. For instance, how could I "manually" set the step to zero?

4. Are you willing to contribute it? (Yes or No)

Yes

saberkun commented 4 years ago

@soares-f Thanks! Yes, they are good points. The checkpoint averaging utils and init from checkpoint feature is not implemented. We are working on improving this repo. @xinliupitt

lehougoogle commented 4 years ago

@soares-f, For feature #2, I think we can get a new optimizer using _create_optimizer(), to effectively reset the optimizer:

while current_step < flags_obj.train_steps_stage_0: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))

del opt opt = self._create_optimizer()

while current_step < flags_obj.train_steps_stage_1: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))

soares-f commented 4 years ago

@soares-f, For feature #2, I think we can get a new optimizer using _create_optimizer(), to effectively reset the optimizer:

while current_step < flags_obj.train_steps_stage_0: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))

del opt opt = self._create_optimizer()

while current_step < flags_obj.train_steps_stage_1: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))

Hi, sorry for the late response, I was quite busy with the whole WMT shared task. I will try that, and as for the averaging checkpoint, I'm thinking about adapting Tensor2Tensor implementation (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py) to fit this one. For what I saw, it is just a matter of not importing optimizer variables.

laxmareddyp commented 1 year ago

Hi @soares-f,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base. The TF models official team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate. Please follow the release notes to stay up to date with the latest developments which are happening in the TF models official space.

github-actions[bot] commented 1 year ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed due to lack of activity after being marked stale for past 7 days.