`dataset.reduce` error in Multi-GPU simulation of `optimization`

hongliny commented 3 years ago

Hi there,

I am trying to launch a multi-gpu experiments based on research/optimization, but keeps getting errors involving datasets.reduce as below

ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iteration.Reduce op will be functional after b/159180073.

I tried to replace this line and this line with for batch in iter(dataset), but the issue persists. I couldn't find any other potential usage of dataset.reduce.

Here is the prompt I used to reproduce this issue

bazel run main:federated_trainer -- --task=emnist_cr --total_rounds=100 \
--client_optimizer=sgd --client_learning_rate=0.1 --client_batch_size=20 \
--server_optimizer=sgd --server_learning_rate=1.0 --clients_per_round=10 \
--client_epochs_per_round=1 --experiment_name=emnist_fedavg_experiment \

Any help will be greatly appreciated.

hongliny commented 3 years ago

I find out the solution: the tff.learning.build_federated_evaluation(tff_model_fn) should also be replaced with tff.learning.build_federated_evaluation(tff_model_fn, use_experimental_simulation_loop=True).

Closing this issue.

hongliny commented 3 years ago

zcharles8 commented 3 years ago

Re-opening for posterity while issues with #33 are being investigated.

nightldj commented 3 years ago

We seem to have a good understanding about what is happening. In short, in multi-GPU environment, use experimental_simulation_loop=True for tff.learning functions, and for...iter(dataset) for your customized training loops. Note that dropout and layers have internal randomness may sometimes give unexpected results and should be used with care. Closing this issue for now.

google-research / federated

`dataset.reduce` error in Multi-GPU simulation of `optimization` #32