google-research / federated

A collection of Google research projects related to Federated Learning and Federated Analytics.
Apache License 2.0
693 stars 196 forks source link

fix dataset.reduce issue for optimization directory #33

Closed hongliny closed 3 years ago

google-cla[bot] commented 3 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

hongliny commented 3 years ago

@googlebot I signed it!

zcharles8 commented 3 years ago

While this PR looks good to me, we noticed some strange behavior of Keras models with dropout when we changed to from batch in iter(dataset). This is purely a TF issue (involving unexpected interactions between TF dataset ops and Keras layers).

We are currently investigating this, and will push this PR when we have figured out what's going on.

hbmcmahan commented 3 years ago

Quick update on this --- we are still following up with the TensorFlow team to see if we can identify the underlying issue with Keras.

amitport commented 3 years ago

@hbmcmahan @zcharles8 Hi, can you expand on the issue? Should we currently use the experimental loop with nightly versions of tf and tff? Or does this issue means that the results will be inaccurate?

zcharles8 commented 3 years ago

@amitport Just to clarify: You refer to use_experimental_loop, which I take it means that you are using tff.learning? Just want to understand the scope here, since it doesn't sound like this is part of the PR in question here.

amitport commented 3 years ago

I mean both. If I'm using this PR for running with multiple GPUs or using tff.learning use_experimental_loop (which seems to be doing the same thing internally), am I'm getting inaccurate results? If tff.learning is OK while this PR isn't, I wonder what is the difference between these implemtations (of the dataset iteration) Thank you!

zcharles8 commented 3 years ago

@amitport @hongliny Our current recommendation is to avoid using iter(dataset) or use_experimental_loop unless you are doing multi-GPU experiments, in which case it is necessary.

In most cases, we have not seen any differences between training behavior in either case. The only cases I have seen are in models that use dropout layers. I don't have any idea what's happening there. It seems like training still occurs using dataset or iter(dataset), but the model updates differ slightly. I don't have any more details yet unfortunately, this is a pretty mysterious thing.

zcharles8 commented 3 years ago

One last clarification: This weird behavior of dropout when using iter(dataset) is a pure TF issue, and occurs independently of TFF, so I'm not sure we have much ability to fix it.

amitport commented 3 years ago

@zcharles8 OK thanks! To be on the safe side, where reproducibility matters, I would not be using multi-GPU for a while (it's expensive anyway ;) ).