Closed hongliny closed 3 years ago
@googlebot I signed it!
While this PR looks good to me, we noticed some strange behavior of Keras models with dropout when we changed to from batch in iter(dataset)
. This is purely a TF issue (involving unexpected interactions between TF dataset ops and Keras layers).
We are currently investigating this, and will push this PR when we have figured out what's going on.
Quick update on this --- we are still following up with the TensorFlow team to see if we can identify the underlying issue with Keras.
@hbmcmahan @zcharles8 Hi, can you expand on the issue? Should we currently use the experimental loop with nightly versions of tf and tff? Or does this issue means that the results will be inaccurate?
@amitport Just to clarify: You refer to use_experimental_loop
, which I take it means that you are using tff.learning
? Just want to understand the scope here, since it doesn't sound like this is part of the PR in question here.
I mean both. If I'm using this PR for running with multiple GPUs or using tff.learning use_experimental_loop
(which seems to be doing the same thing internally), am I'm getting inaccurate results? If tff.learning is OK while this PR isn't, I wonder what is the difference between these implemtations (of the dataset iteration)
Thank you!
@amitport @hongliny Our current recommendation is to avoid using iter(dataset)
or use_experimental_loop
unless you are doing multi-GPU experiments, in which case it is necessary.
In most cases, we have not seen any differences between training behavior in either case. The only cases I have seen are in models that use dropout layers. I don't have any idea what's happening there. It seems like training still occurs using dataset
or iter(dataset)
, but the model updates differ slightly. I don't have any more details yet unfortunately, this is a pretty mysterious thing.
One last clarification: This weird behavior of dropout when using iter(dataset)
is a pure TF issue, and occurs independently of TFF, so I'm not sure we have much ability to fix it.
@zcharles8 OK thanks! To be on the safe side, where reproducibility matters, I would not be using multi-GPU for a while (it's expensive anyway ;) ).
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
:memo: Please visit https://cla.developers.google.com/ to sign.
Once you've signed (or fixed any issues), please reply here with
@googlebot I signed it!
and we'll verify it.What to do if you already signed the CLA
Individual signers
Corporate signers
ℹ️ Googlers: Go here for more info.