Open zredlined opened 4 years ago
Quick update. We have initial feedback from the TensorFlow team in the issue above that the new codepath was disabled due to an internal issue at Google.
This issue should affect any code running an LSTM with TF-privacy on TF 2.4+. Are there any options with TF-privacy to get past this slowdown, or to re-enable the codepath optionally?
@zredlined can you add me to the thread with the TF team or point me to the github issue?
@aterzis-google - any feedback or thoughts on our request in https://github.com/tensorflow/tensorflow/issues/44917 to add _use_new_code()
as a user selectable parameter? This issue should affect anyone using an RNN/LSTM/GRU with TensorFlow Privacy. Thanks!
Seems there's agreement in https://github.com/tensorflow/tensorflow/issues/44917 to make it a user selectable parameter.
Hey TF privacy team- we noticed a pretty significant slowdown working with TensorFlow privacy LSTM and GRU models on TF 2.4. This appears to only happen when using the TF-privacy optimizers. Here is an example Gist where (depending on the version of TF installed) training can go from 15 sec/epoch to 2 mins+ per epoch with the latest TF release candidate (tensorflow==2.4.0rc1).
Doing some testing, it looks like the slowdown was introduced in between these two tf-nightly builds.
Environment: GCP, running on Tesla V100, 16GB RAM, Ubuntu, 8 vCPU
Recreate the issue with this Gist https://gist.github.com/zredlined/72305ab04670197869e470b232d22ed4
I think this TensorFlow commit is the culprit-- changing
use_new_code()
back to True speeds the code back up. https://github.com/tensorflow/tensorflow/commit/73b709743a2eba2c912351e8d3334ef25e174c4bThe only reference I can find is in the issue above for what looks like an internal Google issue? Any help would be hugely appreciated, on most datasets we have tested with slowdowns are 10-20x. Thanks!