Open AlexFuster opened 3 years ago
@AlexFuster Thanks for creating this issue. Looks like this is more related to keras-team/keras. So, I moved this issue to keras-team/keras repo for resolving. Thanks!
Well, Keras team doesn't seem to agree with that
@AlexFuster looks like more related to TF core. So, this repo is right place for this issue. Thanks!
I noticed an interesting code flipping recurrent_v2._use_new_code() to False:
https://github.com/tensorflow/tensorflow/commit/73b709743a2eba2c912351e8d3334ef25e174c4b
This could explain why the performance degraded again.
I checked if I monkey patch a revert of the commit, the speed drastically improves:
from keras.layers import recurrent_v2
recurrent_v2._use_new_code = lambda : True
The change is by @yhliang2018 -- any backgrounds on why we have reverted to the old / slower code path?
Hi,
Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.
The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.
Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.
System information
Describe the current behavior The performance was very low in graph mode when using persistent mode tf.GradientTape or create multi-GradientTape objects in one with block. This phenomenon only happens when the model includes a LSTM or GRU layers.
Standalone code to reproduce the issue
output
output
output
Other info Both train_step_0 and train_step_1 show the error, while train_step_2 doesn't. In my GPU, the first 2 approaches take around 17 in doing 100 training steps, while the third one takes 4.3s. Furthermore, we can only reproduce this performace drop when using GRU/LSTMs in graph mode. Which is, if we remove the tf.function decorator from the train_step functions or if we switch the LSTM by a dense layer, all 3 examples take the same time and none of them outputs any error. As an additional info, this problem happens running both in CPU and in GPU
By the way, this issue is an updated version of #35928 which addressed a very similar problem