Summary:
When utilizing MultiWorkerMirroredStrategy in a distributed training setup, an IndexError is encountered during the execution of optimizer.apply_gradients(), specifically within the cross_device_ops component.
for per_replica in reversed(per_replica_values):
for i in range(len(self._devices)):
values_by_device[i].append(per_replica.values[i])
Details:
Environment:
TensorFlow Version: 2.4.0
Cluster Setup: Multi-node with 2 nodes
Strategy: MultiWorkerMirroredStrategy
Issue Description:
During the training process, specifically at the point of executing optimizer.apply_gradients(), an IndexError is raised from the cross_device_ops component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.
Reproduction Steps:
Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
Initialize MultiWorkerMirroredStrategy within the training script.
Execute the training script which involves defining a model, compiling it, and calling model.fit() on the distributed dataset.
Observe the occurrence of IndexError during the optimizer.apply_gradients() call.
num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \
python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \
--distribution_strategy=multi_worker_mirrored \
--all_reduce_alg=nccl \
--batch_size=$(( 128 * $num_gpus * $num_workers )) \
--enable_eager \
--num_gpus=$num_gpus \
--lr_schedule=polynomial \
--optimizer=LARS
Expected Behavior:
The optimizer.apply_gradients() should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.
Observed Behavior:
An IndexError is raised during the optimizer.apply_gradients() call, originating from the cross_device_ops, which disrupts the training process.
Impact:
This issue prevents the successful execution of distributed training with MultiWorkerMirroredStrategy, hindering the scalability and efficiency of the training process across multiple nodes.
Traceback (most recent call last):
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in <module>
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
stats = run(flags.FLAGS)
File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
resnet_controller.train(evaluate=not flags_obj.skip_eval)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
train_outputs = self.train_fn(steps_per_loop)
File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
self.train_loop_fn(self.train_iter, num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, **kwds))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().__wrapped__(*args, **kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:
/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn *
step_fn(iterator)
/home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:350 _apply_grads_and_clear *
distribution.extended.call_for_each_replica(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica **
return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
self._container_strategy(), fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
return _call_for_each_replica(strategy, fn, args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
coord.join(threads)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:389 join
six.reraise(*self._exc_info_to_raise)
/usr/local/lib/python3.6/dist-packages/six.py:703 reraise
raise value
/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
yield
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
**merge_kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:152 _all_reduce_sum_fn **
grads_and_vars)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2374 batch_reduce_to
return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:697 _batch_reduce_to
options=self._communication_options.merge(options))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:426 batch_reduce
options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 batch_reduce_implementation
for value, dest in value_destination_pairs
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 <listcomp>
for value, dest in value_destination_pairs
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1050 reduce_implementation
options)[0]
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1103 _batch_all_reduce
options)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1142 _do_batch_all_reduce_dense
values_by_device[i].append(per_replica.values[i])
IndexError: tuple index out of range
Hello mlcommons teams!
When using
MultiWorkerMirroredStrategy
, it has been observed that thecross_device_ops
raises anIndexError
duringoptimizer.apply_gradient()
.https://github.com/mlcommons/training/blob/87405ce77af1512bdf6b14288f52cd3fafa3cb71/image_classification/tensorflow2/resnet_runnable.py#L323-L324
https://github.com/tensorflow/tensorflow/blob/64918868e2154b06c7479347a59a4230f785e9fa/tensorflow/python/distribute/cross_device_ops.py#L1140-L1142
Summary: When utilizing
MultiWorkerMirroredStrategy
in a distributed training setup, anIndexError
is encountered during the execution ofoptimizer.apply_gradients()
, specifically within thecross_device_ops
component.Details:
Environment:
MultiWorkerMirroredStrategy
Issue Description: During the training process, specifically at the point of executing
optimizer.apply_gradients()
, anIndexError
is raised from thecross_device_ops
component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.Reproduction Steps:
TF_CONFIG
settings for multi-node operation.MultiWorkerMirroredStrategy
within the training script.model.fit()
on the distributed dataset.IndexError
during theoptimizer.apply_gradients()
call.Expected Behavior: The
optimizer.apply_gradients()
should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.Observed Behavior: An
IndexError
is raised during theoptimizer.apply_gradients()
call, originating from thecross_device_ops
, which disrupts the training process.Impact: This issue prevents the successful execution of distributed training with
MultiWorkerMirroredStrategy
, hindering the scalability and efficiency of the training process across multiple nodes.Refs
739