Closed soares-f closed 1 year ago
@soares-f Thanks! Yes, they are good points. The checkpoint averaging utils and init from checkpoint
feature is not implemented. We are working on improving this repo.
@xinliupitt
@soares-f, For feature #2, I think we can get a new optimizer using _create_optimizer(), to effectively reset the optimizer:
while current_step < flags_obj.train_steps_stage_0: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))
del opt opt = self._create_optimizer()
while current_step < flags_obj.train_steps_stage_1: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))
@soares-f, For feature #2, I think we can get a new optimizer using _create_optimizer(), to effectively reset the optimizer:
while current_step < flags_obj.train_steps_stage_0: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))
del opt opt = self._create_optimizer()
while current_step < flags_obj.train_steps_stage_1: train_steps(train_ds_iterator, tf.convert_to_tensor(train_steps_per_eval, dtype=tf.int32))
Hi, sorry for the late response, I was quite busy with the whole WMT shared task. I will try that, and as for the averaging checkpoint, I'm thinking about adapting Tensor2Tensor implementation (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py) to fit this one. For what I saw, it is just a matter of not importing optimizer variables.
Hi @soares-f,
Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base. The TF models official team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate. Please follow the release notes to stay up to date with the latest developments which are happening in the TF models official space.
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.
This issue was closed due to lack of activity after being marked stale for past 7 days.
Prerequisites
Please answer the following question for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer_main.py
2. Describe the feature you request
It would be interesting to have the following features:
3. Additional context
Feature 1 (checkpoint averaging) is implemented in Tensor2Tensor (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py) but I'm not exactly sure how to adapt that to the transformer code in this repository.
Feature 2: I'm not exactly sure how to do incremental training with this model. Theoretically, I'd only need the new data, but I'm aware that the optimizer parameters should be "reset" to an initial state. However, by inspecting the code I cannot find it straight away how to perform that. For instance, how could I "manually" set the step to zero?
4. Are you willing to contribute it? (Yes or No)
Yes