google-research / federated

A collection of Google research projects related to Federated Learning and Federated Analytics.
Apache License 2.0
690 stars 195 forks source link

NotImplementedError("b/162106885") for Optimization Folder #24

Closed houcharlie closed 3 years ago

houcharlie commented 3 years ago

Environment: Tensorflow 2.3.0, Tensorflow Federated 0.17.0, Ubuntu 18.04, Bazel 3.1

When I run the given command bazel run main:federated_trainer -- --task=emnist_cr --total_rounds=100 \ --client_optimizer=sgd --client_learning_rate=0.1 --client_batch_size=20 \ --server_optimizer=sgd --server_learning_rate=1.0 --clients_per_round=10 \ --client_epochs_per_round=1 --experiment_name=emnist_fedavg_experiment

I get the following error DEBUG: Rule 'rules_python' indicated that a canonical reproducible form can be obtained by modifying arguments commit = "a0fbf98d4e3a232144df4d0d80b577c7a693b570", shallow_since = "1586444447 +0200" and dropping ["tag"] DEBUG: Repository rules_python instantiated at: no stack (--record_rule_instantiation_callstack not enabled) Repository rule git_repository defined at: /jet/home/houc/.cache/bazel/_bazel_houc/c7f7578c4b4c04555c85530cc5b041a3/external/bazel_tools/tools/build_defs/repo/git.bzl:195:18: in <toplevel> INFO: Analyzed target //optimization/main:federated_trainer (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //optimization/main:federated_trainer up-to-date: bazel-bin/optimization/main/federated_trainer INFO: Elapsed time: 0.184s, Critical Path: 0.01s INFO: 0 processes. INFO: Build completed successfully, 1 total action INFO: Running command line: bazel-bin/optimization/main/federated_trainer '--task=emnist_cr' '--total_rounds=100' '--client_optimizer=sgd' '--client_learning_rate=0.1' '--client_batch_sizeINFO: Build completed successfully, 1 total action 2021-03-02 16:33:04.617392: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I0302 16:33:29.877769 140271334733632 client_data.py:154] Using newer tf.data.Dataset construction behavior. 2021-03-02 16:33:29.883310: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs 2021-03-02 16:33:29.883337: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303) 2021-03-02 16:33:29.883361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (br013.ib.bridges2.psc.edu): /proc/driver/nvidia/version does not exist 2021-03-02 16:33:29.941923: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2245890000 Hz 2021-03-02 16:33:29.962079: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x57f41e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2021-03-02 16:33:29.962143: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version I0302 16:33:39.132875 140271334733632 client_data.py:154] Using newer tf.data.Dataset construction behavior. Traceback (most recent call last): File "/jet/home/houc/.cache/bazel/_bazel_houc/c7f7578c4b4c04555c85530cc5b041a3/execroot/org_federated_research/bazel-out/k8-opt/bin/optimization/main/federated_trainer.runfiles/org_federated_research/optimization/main/federated_trainer.py", line 261, in <module> app.run(main) File "/jet/home/houc/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/jet/home/houc/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/jet/home/houc/.cache/bazel/_bazel_houc/c7f7578c4b4c04555c85530cc5b041a3/execroot/org_federated_research/bazel-out/k8-opt/bin/optimization/main/federated_trainer.runfiles/org_federated_research/optimization/main/federated_trainer.py", line 219, in main task_spec, model=FLAGS.emnist_cr_model) File "/jet/home/houc/.cache/bazel/_bazel_houc/c7f7578c4b4c04555c85530cc5b041a3/execroot/org_federated_research/bazel-out/k8-opt/bin/optimization/main/federated_trainer.runfiles/org_federated_research/optimization/emnist/federated_emnist.py", line 82, in configure_training @tff.tf_computation(tf.string) File "/jet/home/houc/.local/lib/python3.6/site-packages/tensorflow_federated/python/core/impl/wrappers/computation_wrapper.py", line 407, in __call__ result = fn_to_wrap(*args, **kwargs) File "/jet/home/houc/.cache/bazel/_bazel_houc/c7f7578c4b4c04555c85530cc5b041a3/execroot/org_federated_research/bazel-out/k8-opt/bin/optimization/main/federated_trainer.runfiles/org_federated_research/optimization/emnist/federated_emnist.py", line 84, in build_train_dataset_from_client_id client_dataset = emnist_train.dataset_computation(client_id) File "/jet/home/houc/.local/lib/python3.6/site-packages/tensorflow_federated/python/simulation/hdf5_client_data.py", line 86, in dataset_computation raise NotImplementedError("b/162106885") NotImplementedError: b/162106885

I notice that commits are still being actively made in this folder. Is it not currently stable? And if not, is there a commit id on which the code will run? Or is this folder no longer compatible with the environment I specified at the beginning of the post?

zcharles8 commented 3 years ago

Hi @houcharlie. We recommend using the version of TFF given by tensorflow-federated-nightly(see https://pypi.org/project/tensorflow-federated-nightly/). In particular, the NotImplementedError seems to be because your version of TFF does not include commit a22bdabb02fd2f6eb9dc4b8c4459c1bf204fa829.

As for the stability of the directory, the important thing here is that Federated Research is updated in part due to updates in TensorFlow Federated, which is still very much in active development. That being said, we try our best to make sure that all code in this directory is tested and functioning as intended (notably, the tests are run with tensorflow-federated-nightly). While I don't have a specific commit for you, commit 42ec49634d9d27d0ac5d16820271d6d2cc5b55b9 should work with current and future versions of tensorflow-federated-nightly.

Last, the ongoing updates to this directory are actually in service of an effort to upstream this directory to TFF. We understand that this can cause unintended issues in the short term (please notify us if something seems broken!), it should be much more useful in the long term.

zcharles8 commented 3 years ago

Let me know if you have any other questions on this! Happy to answer, help, and provide guidance. Thanks for your interest!

houcharlie commented 3 years ago

Yep, that worked. Thanks!