tensorflow / adanet

Fast and flexible AutoML with learning guarantees.
https://adanet.readthedocs.io
Apache License 2.0
3.47k stars 529 forks source link

Support for tf.distribute.Strategy #87

Open cweill opened 5 years ago

cweill commented 5 years ago

AdaNet doesn't currently support tf.distribute.Strategy. The current way to define distributed training is using a tf.estimator.RunConfig with the TF_CONFIG environment variable properly set to identify different workers.

Refs #76

chamorajg commented 5 years ago

I am new to this source code. I want to contribute to few open source AI, and ML projects to gain experience. Can I work on this issue ? Can you suggest me on what to be done ? I went through the code there are certain TODO's in placement.py written in comment section, if given permission and some guidance can I work on that ?

chamorajg commented 5 years ago

@cweill Can you just give me a vague idea on how to make adanet support tf.distribute.Strategy. I have good experience with tensorflow but the source code is quite big to search for, it would be helpful for me to make a quick start.

cweill commented 5 years ago

@chandramoulirajagopalan: The best way to get started will be to first extend estimator_distributed_test_runner.py to test your implementation. You can pass then pass the tf.distribute.Strategy you want to test to the tf.estimator.RunConfig, when constructing the AdaNet Estimator. If it works, then great! If it doesn't then feel free to post your update here, and we'll work through it together.

chamorajg commented 5 years ago

@cweill Yes I will work on that file first to test my implementation on the estimator similar to issue #54. Where the tf.distribute.Strategy.MirroredStrategy was used in it.

cweill commented 5 years ago

@chandramoulirajagopalan: Just a heads up: tf.distribute.MirroredStrategy I believe is designed for multi-GPU, so may be difficult to test. But if you get it to run inside estimator_distributed_test_runner.py, then good work. Let us know if you have any questions.

chamorajg commented 5 years ago
FAIL: test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps (adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest)
test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps(estimator='estimator_with_distributed_mirrored_strategy', placement_strategy='replication', num_workers=5, num_ps=3)
----------------------------------------------------------------------
   Traceback (most recent call last):
    /home/chandramouli/.local/lib/python3.6/site-packages/absl/testing/parameterized.py line 262 in bound_param_test
      test_method(self, **testcase_params)
    adanet/core/estimator_distributed_test.py line 325 in test_distributed_training
      timeout_secs=500)
    adanet/core/estimator_distributed_test.py line 169 in _wait_for_processes
      self.assertEqual(0, ret_code)
   AssertionError: 0 != 1
   -------------------- >> begin captured logging << --------------------
   absl: INFO: Spawning chief_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning worker_3 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning ps_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Spawning evaluator_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
   absl: INFO: worker_0 finished
   absl: INFO: stderr for worker_0 (last 15000 chars): WARNING: Logging before flag parsing goes to stderr.
   W0608 00:13:42.569955 140367193364288 report_accessor.py:36] Failed to import report_pb2. ReportMaterializer will not work.
   I0608 00:13:42.935579 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
   2019-06-08 00:13:42.936921: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
   2019-06-08 00:13:42.983584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999890000 Hz
   2019-06-08 00:13:42.983983: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x20dff50 executing computations on platform Host. Devices:
   2019-06-08 00:13:42.984038: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
   I0608 00:13:43.000412 140367193364288 cross_device_ops.py:975] Device is available but not used by distribute strategy: /device:XLA_CPU:0
   W0608 00:13:43.001587 140367193364288 cross_device_ops.py:983] Not all devices in `tf.distribute.Strategy` are visible to TensorFlow.
   I0608 00:13:43.001820 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
   I0608 00:13:43.002067 140367193364288 run_config.py:532] Initializing RunConfig with distribution strategies.
   I0608 00:13:43.002384 140367193364288 estimator_training.py:176] RunConfig initialized for Distribute Coordinator with INDEPENDENT_WORKER mode
   W0608 00:13:43.003975 140367193364288 estimator.py:1760] Using temporary folder as model directory: /tmp/tmpv70w4krt
   I0608 00:13:43.004830 140367193364288 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpv70w4krt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
   graph_options {
     rewrite_options {
       meta_optimizer_iterations: ONE
     }
   }
   , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fa98a429748>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa98a4299b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'independent_worker'}
   Traceback (most recent call last):
     File "adanet/core/estimator_distributed_test_runner.py", line 350, in <module>
       app.run(main)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
       _run_main(main, args)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
       sys.exit(main(argv))
     File "adanet/core/estimator_distributed_test_runner.py", line 346, in main
       train_and_evaluate_estimator()
     File "adanet/core/estimator_distributed_test_runner.py", line 318, in train_and_evaluate_estimator
       classifier.train(input_fn=_input_fn)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
       loss = self._train_model(input_fn, hooks, saving_listeners)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
       return self._train_model_distributed(input_fn, hooks, saving_listeners)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
       hooks)
     File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 302, in estimator_train
       if 'evaluator' in cluster_spec:
   TypeError: argument of type 'ClusterSpec' is not iterable

   --------------------- >> end captured logging << ---------------------
cweill commented 5 years ago

Good work getting that inside the runner. I'm surprised that the error is coming from deep down in TensorFlow Estimator. If you create a PR, I can have a look there.