Summary:
The distributed training setup functions correctly with OneDeviceStrategy and MirroredStrategy. However, when transitioning to MultiWorkerMirroredStrategy, the local_replica_id fails to return a valid value. Instead, it returns None, and the tensor ("while/cond/replica_id_in_sync_group:0") appears to be empty.
Details:
Environment:
TensorFlow Version: 2.4.0
Cluster Setup: Multi-node with 2 nodes
Strategies Tested:
OneDeviceStrategy: Successful execution
MirroredStrategy: Successful execution
MultiWorkerMirroredStrategy: Fails with None for local_replica_id
Issue Description:
When utilizing MultiWorkerMirroredStrategy, the local_replica_id is not assigned correctly, resulting in a value of None. Additionally, the tensor ("while/cond/replica_id_in_sync_group:0") is observed to be empty. This issue disrupts the synchronous training process across multiple workers.
Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
Initialize MultiWorkerMirroredStrategy.
Execute the training script designed for distributed training.
Observe the failure to assign a valid local_replica_id and the resulting empty tensor value.
num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \
python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \
--distribution_strategy=multi_worker_mirrored \
--all_reduce_alg=nccl \
--batch_size=$(( 128 * $num_gpus * $num_workers )) \
--enable_eager \
--num_gpus=$num_gpus \
--lr_schedule=polynomial \
--optimizer=LARS
Expected Behavior:
The local_replica_id should be correctly assigned for each worker in the cluster, enabling proper synchronization and distributed training.
Observed Behavior:
The local_replica_id is None, leading to an empty tensor for "while/cond/replica_id_in_sync_group:0".
INFO:tensorflow:Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
I0521 02:14:08.696481 139974980654848 coordinator.py:219] Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
INFO:tensorflow:Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 228, in _call_for_each_replica
**merge_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 186, in _maybe_apply_grads_and_clear
ag__.converted_call(ag__.ld(tf).cond, (ag__.converted_call(ag__.ld(tf).equal, ((ag__.ld(self).optimizer.iterations % ag__.ld(self).num_accumulation_steps), (ag__.ld(self).num_accumulation_steps - 1)), None, fscope_2), ag__.ld(_apply_grads_and_clear), ag__.ld(_advance_iteration)), None, fscope_2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 479, in _call_unconverted
return f(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1396, in cond_for_tf_v2
return cond(pred, true_fn=true_fn, false_fn=false_fn, strict=True, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1180, in cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/cond_v2.py", line 89, in cond_v2
op_return_value=pred)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 165, in _apply_grads_and_clear
ag__.converted_call(ag__.ld(distribution).extended.call_for_each_replica, (ag__.ld(_apply_grads_and_clear_for_each_replica),), dict(args=()), fscope_4)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 629, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 93, in call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 234, in _call_for_each_replica
coord.join(threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
I0521 02:14:08.698780 140697669199680 coordinator.py:219] Error reported to Coordinator: tuple indices must be integers or slices, not NoneType
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 228, in _call_for_each_replica
**merge_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 186, in _maybe_apply_grads_and_clear
ag__.converted_call(ag__.ld(tf).cond, (ag__.converted_call(ag__.ld(tf).equal, ((ag__.ld(self).optimizer.iterations % ag__.ld(self).num_accumulation_steps), (ag__.ld(self).num_accumulation_steps - 1)), None, fscope_2), ag__.ld(_apply_grads_and_clear), ag__.ld(_advance_iteration)), None, fscope_2)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 479, in _call_unconverted
return f(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1396, in cond_for_tf_v2
return cond(pred, true_fn=true_fn, false_fn=false_fn, strict=True, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 538, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1180, in cond
return cond_v2.cond_v2(pred, true_fn, false_fn, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/cond_v2.py", line 89, in cond_v2
op_return_value=pred)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 165, in _apply_grads_and_clear
ag__.converted_call(ag__.ld(distribution).extended.call_for_each_replica, (ag__.ld(_apply_grads_and_clear_for_each_replica),), dict(args=()), fscope_4)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2730, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 629, in _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 93, in call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 234, in _call_for_each_replica
coord.join(threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 323, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/tmp/tmpsgvrk_5a.py", line 146, in _apply_grads_and_clear_for_each_replica
ag__.for_stmt(ag__.converted_call(ag__.ld(zip), (ag__.ld(self).accum_grads, ag__.ld(self).training_vars), None, fscope_3), None, loop_body, get_state_5, set_state_5, (), {'iterate_names': '(accum_grad, var)'})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 444, in for_stmt
_py_for_stmt(iter_, extra_test, body, None, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 473, in _py_for_stmt
body(target)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/control_flow.py", line 459, in protected_body
original_body(protected_iter)
File "/tmp/tmpsgvrk_5a.py", line 139, in loop_body
replica_accum_grad = ag__.ld(local_accum_grad)[ag__.ld(local_replica_id)]
TypeError: tuple indices must be integers or slices, not NoneType
tf.distribute.get_replica_context().replica_id_in_sync_group: Tensor("while/cond/replica_id_in_sync_group:0", shape=(), dtype=int32, device=/job:worker/replica:0/task:0/device:GPU:0)
local_replica_id: None
Traceback (most recent call last):
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
stats = run(flags.FLAGS)
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
TypeError: in user code:
/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn *
step_fn(iterator)
/home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:328 _apply_grads_and_clear_for_each_replica *
replica_accum_grad = local_accum_grad[local_replica_id]
TypeError: tuple indices must be integers or slices, not NoneType
Impact:
This issue prevents successful distributed training with MultiWorkerMirroredStrategy, limiting the ability to scale training across multiple nodes.
Hello mlcommons team!
Summary: The distributed training setup functions correctly with
OneDeviceStrategy
andMirroredStrategy
. However, when transitioning toMultiWorkerMirroredStrategy
, thelocal_replica_id
fails to return a valid value. Instead, it returnsNone
, and the tensor ("while/cond/replica_id_in_sync_group:0") appears to be empty.Details:
Environment:
OneDeviceStrategy
: Successful executionMirroredStrategy
: Successful executionMultiWorkerMirroredStrategy
: Fails withNone
forlocal_replica_id
Issue Description: When utilizing
MultiWorkerMirroredStrategy
, thelocal_replica_id
is not assigned correctly, resulting in a value ofNone
. Additionally, the tensor ("while/cond/replica_id_in_sync_group:0") is observed to be empty. This issue disrupts the synchronous training process across multiple workers.https://github.com/mlcommons/training/blob/87405ce77af1512bdf6b14288f52cd3fafa3cb71/image_classification/tensorflow2/resnet_runnable.py#L312-L314
TF_CONFIG
settings for multi-node operation.MultiWorkerMirroredStrategy
.local_replica_id
and the resulting empty tensor value.Expected Behavior: The
local_replica_id
should be correctly assigned for each worker in the cluster, enabling proper synchronization and distributed training.Observed Behavior: The
local_replica_id
isNone
, leading to an empty tensor for "while/cond/replica_id_in_sync_group:0".Impact: This issue prevents successful distributed training with
MultiWorkerMirroredStrategy
, limiting the ability to scale training across multiple nodes.