google-research / federated

A collection of Google research projects related to Federated Learning and Federated Analytics.
Apache License 2.0
678 stars 194 forks source link

[distributed_dp] Including package versions into the requirements file #57

Open fraboeni opened 2 years ago

fraboeni commented 2 years ago

Hi everyone,

First on all, thank you very much for providing the very nice distributed_dp package.

I was trying to get it to work, and installed the packages referenced in https://github.com/google-research/federated/blob/master/distributed_dp/requirements.txt. Unfortunately, even though I installed the nightly build versions of all the packages as indicated in the README, there seem to be compatibility issues.

I've tried a couple of different combinations of versions for tf, tf-federated, tf-privacy, tf-estimator, but the code was running in none of them.

My current setup is

...
python                    3.9.7                h12debd9_1
keras-nightly             2.9.0.dev2022030808          pypi_0    pypi
tb-nightly                2.9.0a20220307           pypi_0    pypi
tensorboard               2.8.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.6.0                      py_0
tensorflow-datasets       4.5.2                    pypi_0    pypi
tensorflow-federated-nightly 0.19.0.dev20220218          pypi_0    pypi
tensorflow-io-gcs-filesystem 0.24.0                   pypi_0    pypi
tensorflow-metadata       1.7.0                    pypi_0    pypi
tensorflow-model-optimization 0.7.1                    pypi_0    pypi
tensorflow-privacy        0.7.3                    pypi_0    pypi
tensorflow-probability    0.15.0                   pypi_0    pypi
tf-estimator-nightly      2.9.0.dev2022030809          pypi_0    pypi
tf-nightly                2.9.0.dev20220308          pypi_0    pypi
... 

In this setup, I get the error

Traceback (most recent call last):
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 28, in <module>
    from distributed_dp import fl_utils
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_utils.py", line 22, in <module>
    from distributed_dp import accounting_utils
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/accounting_utils.py", line 21, in <module>
    import tensorflow_privacy as tfp
  File "/home/fraboeni/.conda/envs/tf-federated/lib/python3.9/site-packages/tensorflow_privacy/__init__.py", line 30, in <module>
    from tensorflow_privacy import v1
  File "/home/fraboeni/.conda/envs/tf-federated/lib/python3.9/site-packages/tensorflow_privacy/v1/__init__.py", line 32, in <module>
    from tensorflow_privacy.privacy.estimators.v1.dnn import DNNClassifier as DNNClassifierV1
  File "/home/fraboeni/.conda/envs/tf-federated/lib/python3.9/site-packages/tensorflow_privacy/privacy/estimators/v1/dnn.py", line 19, in <module>
    from tensorflow_privacy.privacy.estimators.v1 import head as head_lib
  File "/home/fraboeni/.conda/envs/tf-federated/lib/python3.9/site-packages/tensorflow_privacy/privacy/estimators/v1/head.py", line 22, in <module>
    from tensorflow.python.ops import lookup_ops  # pylint: disable=g-direct-tensorflow-import
ImportError: cannot import name 'lookup_ops' from 'tensorflow.python.ops' (unknown location)

when running bazel run :fl_run

My question now is the following: could you share version numbers in your requirement.txt file for which the code is successfully running?

kenziyuliu commented 2 years ago

Hi @fraboeni,

Thanks for your interest! I just tried locally cloning the repo and starting a new conda environment, and I was able to get it running using the following commands:

conda create -n tff python=3.9
conda activate tff
pip install -r requirements.txt   # inside `distributed_dp/`
pip install tensorflow-addons
bazel run :fl_run  # the example command for EMNIST

The specific versions of the related packages:

...
python                    3.9.7                h88f2d9e_1
tensorboard               2.8.0                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tensorflow                2.8.0                    pypi_0    pypi
tensorflow-addons         0.16.1                   pypi_0    pypi
tensorflow-datasets       4.5.2                    pypi_0    pypi
tensorflow-estimator      2.8.0                    pypi_0    pypi
tensorflow-federated      0.20.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.24.0                   pypi_0    pypi
tensorflow-metadata       1.7.0                    pypi_0    pypi
tensorflow-model-optimization 0.7.1                    pypi_0    pypi
tensorflow-privacy        0.7.3                    pypi_0    pypi
tensorflow-probability    0.16.0                   pypi_0    pypi
tf-estimator-nightly      2.8.0.dev2021122109          pypi_0    pypi
...

It seems that nightly builds are not needed but you would need tensorflow-addons which was not specified in requirements.txt. Could you try and see if the above works?

fraboeni commented 2 years ago

Thank you so much for your help with that @kenziyuliu. The installation worked just fine.

Now, I am running into different errors: I ran bazel run :fl_run and got

INFO: Analyzed target //distributed_dp:fl_run (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //distributed_dp:fl_run up-to-date:
  bazel-bin/distributed_dp/fl_run
INFO: Elapsed time: 0.120s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
E0310 17:25:35.563270 139699288257280 optimizer_utils.py:264] Unknown optimizer [None], known optimziers are [['sgd', 'adagrad', 'adam', 'yogi', 'lars', 'lamb', 'shampoo']]. To add support for an optimizer, add the optimzier class to the utils_impl._SUPPORTED_OPTIMIZERS list.
Traceback (most recent call last):
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 290, in <module>
    app.run(main)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 185, in main
    client_optimizer_fn = optimizer_utils.create_optimizer_fn_from_flags('client')
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/utils/optimizers/optimizer_utils.py", line 269, in create_optimizer_fn_from_flags
    raise ValueError('`{!s}` is not a valid optimizer for flag --{!s}, must be '
ValueError: `None` is not a valid optimizer for flag --client_optimizer, must be one of ['sgd', 'adagrad', 'adam', 'yogi', 'lars', 'lamb', 'shampoo']. See error log for details.

The issue did not occur when specifying the flags as in your example:

bazel run :fl_run -- \
    --task=emnist_character \
    --server_optimizer=sgd \
    --server_learning_rate=1 \
    --server_sgd_momentum=0.9 \
    --client_optimizer=sgd \
    --client_learning_rate=0.03 \
    --client_batch_size=20 \
    --experiment_name=my_emnist_test \
    --epsilon=10 \
    --l2_norm_clip=0.03 \
    --dp_mechanism=ddgauss \
    --logtostderr

This started very promising, then I got a different error:

I0310 17:29:00.706568 139991547269888 fl_utils.py:71] Shared DP Parameters:
I0310 17:29:00.706730 139991547269888 fl_utils.py:72] {'clip': 0.03,
 'delta': 0.0002941176470588235,
 'dim': 1018174,
 'epsilon': 10.0,
 'mechanism': 'ddgauss',
 'num_clients': 3400,
 'num_clients_per_round': 100,
 'num_rounds': 1500,
 'sampling_rate': 0.029411764705882353}
I0310 17:30:57.426323 139991547269888 fl_utils.py:151] ddgauss parameters:
I0310 17:30:57.426513 139991547269888 fl_utils.py:152] {'beta': 0.6065306597126334,
 'bits': 16,
 'dim': 1018174,
 'gamma': 3.292593044721554e-06,
 'inflated_l2': 0.030049064475707276,
 'k_stddevs': 4,
 'local_stddev': 0.002681329925591648,
 'mechanism': 'ddgauss',
 'noise_mult_clip': 0.8937766418638827,
 'noise_mult_inflated': 0.8923172725591274,
 'padded_dim': 1048576.0,
 'scale': 303711.99429067835}
I0310 17:30:57.426573 139991547269888 ddpquery_utils.py:44] Conditional rounding set to True (beta = 0.606531)
I0310 17:30:57.510118 139991547269888 keras_utils.py:365] Adding default num_examples metric to model
I0310 17:30:57.510220 139991547269888 keras_utils.py:368] Adding default num_batches metric to model
I0310 17:30:58.755060 139991547269888 keras_utils.py:365] Adding default num_examples metric to model
I0310 17:30:58.755179 139991547269888 keras_utils.py:368] Adding default num_batches metric to model
I0310 17:31:00.380089 139991547269888 keras_utils.py:365] Adding default num_examples metric to model
I0310 17:31:00.380198 139991547269888 keras_utils.py:368] Adding default num_batches metric to model
I0310 17:31:02.371215 139991547269888 keras_utils.py:365] Adding default num_examples metric to model
I0310 17:31:02.371326 139991547269888 keras_utils.py:368] Adding default num_batches metric to model
I0310 17:31:02.647132 139991547269888 keras_utils.py:365] Adding default num_examples metric to model
I0310 17:31:02.647240 139991547269888 keras_utils.py:368] Adding default num_batches metric to model
I0310 17:31:02.859875 139991547269888 training_utils.py:68] Writing...
I0310 17:31:02.859981 139991547269888 training_utils.py:69]     program state to: /tmp/ddp_fl/checkpoints/my_emnist_test
I0310 17:31:02.860028 139991547269888 training_utils.py:70]     CSV metrics to: /tmp/ddp_fl/results/my_emnist_test/experiment.metrics.csv
I0310 17:31:02.860080 139991547269888 training_utils.py:71]     TensorBoard summaries to: /tmp/ddp_fl/logdir/my_emnist_test
I0310 17:31:02.860128 139991547269888 training_loop.py:189] Running training process
I0310 17:31:03.333363 139991547269888 training_loop.py:201] Initializing training process
I0310 17:31:03.397290 139991547269888 training_loop.py:115] Running evaluation at round 0
Traceback (most recent call last):
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 290, in <module>
    app.run(main)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 274, in main
    state = tff.simulation.run_training_process(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/simulation/training_loop.py", line 206, in run_training_process
    evaluation_metrics = _run_evaluation(evaluation_fn,
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/simulation/training_loop.py", line 119, in _run_evaluation
    evaluation_metrics = evaluation_fn(state, evaluation_data)
  File "/home/fraboeni/.cache/bazel/_bazel_fraboeni/eb0df9f25fbadff22165e0e943d33a0f/execroot/org_federated_research/bazel-out/k8-opt/bin/distributed_dp/fl_run.runfiles/org_federated_research/distributed_dp/fl_run.py", line 270, in evaluation_fn
    return federated_eval(state.model, evaluation_data)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/computation/computation_impl.py", line 119, in __call__
    return context.invoke(self, arg)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/execution_contexts/sync_execution_context.py", line 65, in invoke
    return self._event_loop.run_until_complete(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/retrying.py", line 91, in retry_coro_fn
    raise e
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/retrying.py", line 88, in retry_coro_fn
    return await fn(*args, **kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/execution_contexts/async_execution_context.py", line 300, in invoke
    return await tracing.wrap_coroutine_in_current_trace_context(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 391, in _wrapped
    return await coro
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/execution_contexts/async_execution_context.py", line 141, in _invoke
    result = await executor.create_call(comp, arg)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 346, in create_call
    return await comp_repr.invoke(self, arg)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 166, in invoke
    return await executor._evaluate(comp_lambda.result, new_scope)  # pylint: disable=protected-access
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 516, in _evaluate
    return await self._evaluate_block(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 480, in _evaluate_block
    return await self._evaluate(comp.block.result, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 508, in _evaluate
    return await self._evaluate_reference(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 432, in _evaluate_reference
    return await scope.resolve_reference(comp.reference.name)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 115, in resolve_reference
    return await value
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 510, in _evaluate
    return await self._evaluate_call(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 448, in _evaluate_call
    func, arg = await asyncio.gather(func, get_arg())
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 444, in get_arg
    return await self._evaluate(comp.call.argument, scope=scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 514, in _evaluate
    return await self._evaluate_struct(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 468, in _evaluate_struct
    values = await asyncio.gather(*values)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 508, in _evaluate
    return await self._evaluate_reference(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 432, in _evaluate_reference
    return await scope.resolve_reference(comp.reference.name)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 115, in resolve_reference
    return await value
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 510, in _evaluate
    return await self._evaluate_call(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 448, in _evaluate_call
    func, arg = await asyncio.gather(func, get_arg())
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 444, in get_arg
    return await self._evaluate(comp.call.argument, scope=scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 514, in _evaluate
    return await self._evaluate_struct(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 468, in _evaluate_struct
    values = await asyncio.gather(*values)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 508, in _evaluate
    return await self._evaluate_reference(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 432, in _evaluate_reference
    return await scope.resolve_reference(comp.reference.name)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 115, in resolve_reference
    return await value
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 510, in _evaluate
    return await self._evaluate_call(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 448, in _evaluate_call
    func, arg = await asyncio.gather(func, get_arg())
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 444, in get_arg
    return await self._evaluate(comp.call.argument, scope=scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 508, in _evaluate
    return await self._evaluate_reference(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 432, in _evaluate_reference
    return await scope.resolve_reference(comp.reference.name)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 115, in resolve_reference
    return await value
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 510, in _evaluate
    return await self._evaluate_call(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 448, in _evaluate_call
    func, arg = await asyncio.gather(func, get_arg())
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 444, in get_arg
    return await self._evaluate(comp.call.argument, scope=scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 514, in _evaluate
    return await self._evaluate_struct(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 468, in _evaluate_struct
    values = await asyncio.gather(*values)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 508, in _evaluate
    return await self._evaluate_reference(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 432, in _evaluate_reference
    return await scope.resolve_reference(comp.reference.name)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 115, in resolve_reference
    return await value
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 510, in _evaluate
    return await self._evaluate_call(comp, scope)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 449, in _evaluate_call
    return await self.create_call(func, arg=arg)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/reference_resolving_executor.py", line 342, in create_call
    return ReferenceResolvingExecutorValue(await
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 125, in create_call
    return await self._delegate(self._target_executor.create_call(comp, arg))
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 110, in _delegate
    result_value = await _delegate_with_trace_ctx(coro, self._event_loop)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 391, in _wrapped
    return await coro
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/federating_executor.py", line 457, in create_call
    return await self._strategy.compute_federated_intrinsic(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/federating_executor.py", line 143, in compute_federated_intrinsic
    return await fn(arg)  # pylint: disable=not-callable
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/federated_resolving_strategy.py", line 458, in compute_federated_map
    return await self._map(arg, all_equal=False)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/federated_resolving_strategy.py", line 339, in _map
    results = await asyncio.gather(*[
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/federated_resolving_strategy.py", line 336, in _map_child
    fn_at_child = await child.create_value(fn, fn_type)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 115, in create_value
    return await self._delegate(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/thread_delegating_executor.py", line 110, in _delegate
    result_value = await _delegate_with_trace_ctx(coro, self._event_loop)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 391, in _wrapped
    return await coro
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 201, in async_trace
    result = await fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 683, in create_value
    normalized_value = to_representation_for_type(value,
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 228, in sync_trace
    result = fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 519, in to_representation_for_type
    return _to_computation_internal_rep(
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 228, in sync_trace
    result = fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 405, in _to_computation_internal_rep
    embedded_fn = embed_tensorflow_computation(value, type_spec, device)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 228, in sync_trace
    result = fn(*fn_args, **fn_kwargs)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 273, in embed_tensorflow_computation
    comp = _ensure_comp_runtime_compatible(comp)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 246, in _ensure_comp_runtime_compatible
    _check_dataset_reduce_for_multi_gpu(graph_def)
  File "/home/fraboeni/.conda/envs/tff/lib/python3.9/site-packages/tensorflow_federated/python/core/impl/executors/eager_tf_executor.py", line 63, in _check_dataset_reduce_for_multi_gpu
    raise ValueError(
ValueError: Detected dataset reduce op in multi-GPU TFF simulation: `use_experimental_simulation_loop=True` for `tff.learning`; or use `for ... in iter(dataset)` for your own dataset iterations. See https://www.tensorflow.org/federated/tutorials/simulations_with_accelerators for examples.

Tried fixing that with disabling GPU execution by inserting the following lines here: https://github.com/google-research/federated/blob/ed50f1e19c24086b480b7c5b85c6376a1a9ef1c6/distributed_dp/fl_run.py#L33 (following the tutorial: https://www.tensorflow.org/federated/tutorials/simulations_with_accelerators)

cpu_device = tf.config.list_logical_devices('CPU')[0]
tff.backends.native.set_local_python_execution_context(
    server_tf_device=cpu_device, client_tf_devices=[cpu_device])

and simply re-ran the command.

However, the error stayed the same. Would I have to do some re-build, or can you recommend me another way to get rid of the error resulting from tff?

Thank you very much!

zcharles8 commented 2 years ago

@fraboeni Can you see what happens if you try toggling this line: https://github.com/google-research/federated/blob/ed50f1e19c24086b480b7c5b85c6376a1a9ef1c6/distributed_dp/fl_run.py#L251

For context, the client training that is part of tff.learning.build_federated_averaging_process can go in one of two ways depending on whether you set use_experimental_simulation_loop to True or False. Generally, setting this to True is for multi-GPU simulations.

zcharles8 commented 2 years ago

Also for context @kenziyuliu I believe the nightly TFF packages are currently broken. I believe that using the latest version is the recommended way to proceed (as in your comment above).

fraboeni commented 2 years ago

Thanks for your prompt answer @zcharles8!

Unfortunately, no matter if I set the indicated line to True or False, I still get the same error.

zcharles8 commented 2 years ago

@fraboeni Is that true if you don't add the call to tff.backends.native.set_local_python_execution_context that you described above?

For context, I just ran the command you posted above (purely on CPU) and it worked fine using the default executor.

zcharles8 commented 2 years ago

Oh wait, I see the potential problem. @fraboeni It sounds like you are using a multi-GPU environment based on the error. If that is the case then you would need to alter this line: https://github.com/google-research/federated/blob/master/distributed_dp/fl_run.py#L266

In particular, set use_experimental_simulation_loop=True, matching the argument in tff.learning.build_federated_averaging_process. Let me know if that helps at all, and thanks for digging into this.

fraboeni commented 2 years ago

Thank you very much @zcharles8.

Unfortunately, passing the parameter in the line you indicated also does not solve the issue: federated_eval = tff.learning.build_federated_evaluation(task.model_fn,use_experimental_simulation_loop=True)

I also tried switching off GPUs by

cpu_device = tf.config.list_logical_devices('CPU')[0]
tff.backends.native.set_local_python_execution_context(
    server_tf_device=cpu_device, client_tf_devices=[cpu_device])

Or only using one GPU by that command. Unfortunately, nothing seems to change the error.

fraboeni commented 2 years ago

Hi @zcharles8, are there any news from your side on how we could make the code here run?

kenziyuliu commented 2 years ago

Hi @fraboeni, I tried following https://github.com/google-research/federated/issues/57#issuecomment-1062468566 on a single-GPU machine, and by default things seem to work fine.

Specifically, I followed https://github.com/google-research/federated/issues/57#issuecomment-1062468566, fixed the error in https://github.com/google-research/federated/issues/58, and checked that TF sees the GPU as

>>> tf.config.list_physical_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Running the example script from here seems to work (bazel run :fl_run -- ...). If it's a multi-GPU issue, maybe try forcing a single GPU as a workaround via export CUDA_VISIBLE_DEVICES=0. Hope this helps!

DeepaliKushwaha commented 1 year ago

Can anyone help me solve the same issue while using tff.templates.IterativeProcess instead of tff.learning.build_federated_averaging_process?

kairouzp commented 1 year ago

Could you please expand more on where exactly you are doing? Are you creating a custom iterative process or using one that we are providing in the repo? Could you also please provide a snippet for the error you are seeing?