fix dataset.reduce issue for optimization directory

google-cla[bot] commented 3 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

hongliny commented 3 years ago

@googlebot I signed it!

zcharles8 commented 3 years ago

While this PR looks good to me, we noticed some strange behavior of Keras models with dropout when we changed to from batch in iter(dataset). This is purely a TF issue (involving unexpected interactions between TF dataset ops and Keras layers).

We are currently investigating this, and will push this PR when we have figured out what's going on.

hbmcmahan commented 3 years ago

Quick update on this --- we are still following up with the TensorFlow team to see if we can identify the underlying issue with Keras.

amitport commented 3 years ago

@hbmcmahan @zcharles8 Hi, can you expand on the issue? Should we currently use the experimental loop with nightly versions of tf and tff? Or does this issue means that the results will be inaccurate?

zcharles8 commented 3 years ago

@amitport Just to clarify: You refer to use_experimental_loop, which I take it means that you are using tff.learning? Just want to understand the scope here, since it doesn't sound like this is part of the PR in question here.

amitport commented 3 years ago

I mean both. If I'm using this PR for running with multiple GPUs or using tff.learning use_experimental_loop (which seems to be doing the same thing internally), am I'm getting inaccurate results? If tff.learning is OK while this PR isn't, I wonder what is the difference between these implemtations (of the dataset iteration) Thank you!

zcharles8 commented 3 years ago

@amitport @hongliny Our current recommendation is to avoid using iter(dataset) or use_experimental_loop unless you are doing multi-GPU experiments, in which case it is necessary.

In most cases, we have not seen any differences between training behavior in either case. The only cases I have seen are in models that use dropout layers. I don't have any idea what's happening there. It seems like training still occurs using dataset or iter(dataset), but the model updates differ slightly. I don't have any more details yet unfortunately, this is a pretty mysterious thing.

zcharles8 commented 3 years ago

One last clarification: This weird behavior of dropout when using iter(dataset) is a pure TF issue, and occurs independently of TFF, so I'm not sure we have much ability to fix it.

amitport commented 3 years ago

@zcharles8 OK thanks! To be on the safe side, where reproducibility matters, I would not be using multi-GPU for a while (it's expensive anyway ;) ).

google-research / federated