Open cweill opened 5 years ago
I am new to this source code. I want to contribute to few open source AI, and ML projects to gain experience. Can I work on this issue ? Can you suggest me on what to be done ? I went through the code there are certain TODO's in placement.py written in comment section, if given permission and some guidance can I work on that ?
@cweill Can you just give me a vague idea on how to make adanet support tf.distribute.Strategy. I have good experience with tensorflow but the source code is quite big to search for, it would be helpful for me to make a quick start.
@chandramoulirajagopalan: The best way to get started will be to first extend estimator_distributed_test_runner.py to test your implementation. You can pass then pass the tf.distribute.Strategy
you want to test to the tf.estimator.RunConfig
, when constructing the AdaNet Estimator. If it works, then great! If it doesn't then feel free to post your update here, and we'll work through it together.
@cweill Yes I will work on that file first to test my implementation on the estimator similar to issue #54. Where the tf.distribute.Strategy.MirroredStrategy was used in it.
@chandramoulirajagopalan: Just a heads up: tf.distribute.MirroredStrategy
I believe is designed for multi-GPU, so may be difficult to test. But if you get it to run inside estimator_distributed_test_runner.py
, then good work. Let us know if you have any questions.
FAIL: test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps (adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest)
test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps(estimator='estimator_with_distributed_mirrored_strategy', placement_strategy='replication', num_workers=5, num_ps=3)
----------------------------------------------------------------------
Traceback (most recent call last):
/home/chandramouli/.local/lib/python3.6/site-packages/absl/testing/parameterized.py line 262 in bound_param_test
test_method(self, **testcase_params)
adanet/core/estimator_distributed_test.py line 325 in test_distributed_training
timeout_secs=500)
adanet/core/estimator_distributed_test.py line 169 in _wait_for_processes
self.assertEqual(0, ret_code)
AssertionError: 0 != 1
-------------------- >> begin captured logging << --------------------
absl: INFO: Spawning chief_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning worker_3 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_1 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning ps_2 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Spawning evaluator_0 process: python adanet/core/estimator_distributed_test_runner.py --estimator_type=estimator_with_distributed_mirrored_strategy --placement_strategy=replication --stderrthreshold=info --model_dir=/tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: Logging to /tmp/absl_testing/adanet.core.estimator_distributed_test.EstimatorDistributedTrainingTest.test_distributed_training_estimator_with_distributed_mirrored_strategy_replication_five_workers_three_ps
absl: INFO: worker_0 finished
absl: INFO: stderr for worker_0 (last 15000 chars): WARNING: Logging before flag parsing goes to stderr.
W0608 00:13:42.569955 140367193364288 report_accessor.py:36] Failed to import report_pb2. ReportMaterializer will not work.
I0608 00:13:42.935579 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
2019-06-08 00:13:42.936921: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-08 00:13:42.983584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999890000 Hz
2019-06-08 00:13:42.983983: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x20dff50 executing computations on platform Host. Devices:
2019-06-08 00:13:42.984038: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
I0608 00:13:43.000412 140367193364288 cross_device_ops.py:975] Device is available but not used by distribute strategy: /device:XLA_CPU:0
W0608 00:13:43.001587 140367193364288 cross_device_ops.py:983] Not all devices in `tf.distribute.Strategy` are visible to TensorFlow.
I0608 00:13:43.001820 140367193364288 run_config.py:503] TF_CONFIG environment variable: {'cluster': {'chief': ['localhost:38127'], 'worker': ['localhost:44993', 'localhost:55967', 'localhost:53003', 'localhost:59883'], 'ps': ['localhost:37729', 'localhost:55971', 'localhost:38587']}, 'task': {'type': 'worker', 'index': 0}}
I0608 00:13:43.002067 140367193364288 run_config.py:532] Initializing RunConfig with distribution strategies.
I0608 00:13:43.002384 140367193364288 estimator_training.py:176] RunConfig initialized for Distribute Coordinator with INDEPENDENT_WORKER mode
W0608 00:13:43.003975 140367193364288 estimator.py:1760] Using temporary folder as model directory: /tmp/tmpv70w4krt
I0608 00:13:43.004830 140367193364288 estimator.py:201] Using config: {'_model_dir': '/tmp/tmpv70w4krt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fa98a429748>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa98a4299b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': 'independent_worker'}
Traceback (most recent call last):
File "adanet/core/estimator_distributed_test_runner.py", line 350, in <module>
app.run(main)
File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/chandramouli/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "adanet/core/estimator_distributed_test_runner.py", line 346, in main
train_and_evaluate_estimator()
File "adanet/core/estimator_distributed_test_runner.py", line 318, in train_and_evaluate_estimator
classifier.train(input_fn=_input_fn)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1180, in _train_model_distributed
hooks)
File "/home/chandramouli/.local/lib/python3.6/site-packages/tensorflow/python/distribute/estimator_training.py", line 302, in estimator_train
if 'evaluator' in cluster_spec:
TypeError: argument of type 'ClusterSpec' is not iterable
--------------------- >> end captured logging << ---------------------
Good work getting that inside the runner. I'm surprised that the error is coming from deep down in TensorFlow Estimator. If you create a PR, I can have a look there.
AdaNet doesn't currently support
tf.distribute.Strategy
. The current way to define distributed training is using atf.estimator.RunConfig
with theTF_CONFIG
environment variable properly set to identify different workers.Refs #76