`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy`

Hello mlcommons teams!

When using MultiWorkerMirroredStrategy, it has been observed that the cross_device_ops raises an IndexError during optimizer.apply_gradient().

https://github.com/mlcommons/training/blob/87405ce77af1512bdf6b14288f52cd3fafa3cb71/image_classification/tensorflow2/resnet_runnable.py#L323-L324

https://github.com/tensorflow/tensorflow/blob/64918868e2154b06c7479347a59a4230f785e9fa/tensorflow/python/distribute/cross_device_ops.py#L1140-L1142

Summary: When utilizing MultiWorkerMirroredStrategy in a distributed training setup, an IndexError is encountered during the execution of optimizer.apply_gradients(), specifically within the cross_device_ops component.

for per_replica in reversed(per_replica_values):
  for i in range(len(self._devices)):
    values_by_device[i].append(per_replica.values[i])

Details:

Environment:
- TensorFlow Version: 2.4.0
- Cluster Setup: Multi-node with 2 nodes
- Strategy: MultiWorkerMirroredStrategy
Issue Description: During the training process, specifically at the point of executing optimizer.apply_gradients(), an IndexError is raised from the cross_device_ops component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.
Reproduction Steps:
1. Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
2. Initialize MultiWorkerMirroredStrategy within the training script.
3. Execute the training script which involves defining a model, compiling it, and calling model.fit() on the distributed dataset.
4. Observe the occurrence of IndexError during the optimizer.apply_gradients() call.

num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \
python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \
  --distribution_strategy=multi_worker_mirrored \
  --all_reduce_alg=nccl \
  --batch_size=$(( 128 * $num_gpus * $num_workers )) \
  --enable_eager \
  --num_gpus=$num_gpus \
  --lr_schedule=polynomial \
  --optimizer=LARS

Expected Behavior: The optimizer.apply_gradients() should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.
Observed Behavior: An IndexError is raised during the optimizer.apply_gradients() call, originating from the cross_device_ops, which disrupts the training process.

Impact: This issue prevents the successful execution of distributed training with MultiWorkerMirroredStrategy, hindering the scalability and efficiency of the training process across multiple nodes.

Traceback (most recent call last):
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
    stats = run(flags.FLAGS)
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
    resnet_controller.train(evaluate=not flags_obj.skip_eval)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
    train_outputs = self.train_fn(steps_per_loop)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
    self.train_loop_fn(self.train_iter, num_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:

    /home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn  *
        step_fn(iterator)
    /home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:350 _apply_grads_and_clear  *
        distribution.extended.call_for_each_replica(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica  **
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
        self._container_strategy(), fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
        return _call_for_each_replica(strategy, fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
        coord.join(threads)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:389 join
        six.reraise(*self._exc_info_to_raise)
    /usr/local/lib/python3.6/dist-packages/six.py:703 reraise
        raise value
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
        yield
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
        **merge_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:152 _all_reduce_sum_fn  **
        grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2374 batch_reduce_to
        return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:697 _batch_reduce_to
        options=self._communication_options.merge(options))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:426 batch_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 batch_reduce_implementation
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 <listcomp>
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1050 reduce_implementation
        options)[0]
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1103 _batch_all_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1142 _do_batch_all_reduce_dense
        values_by_device[i].append(per_replica.values[i])

    IndexError: tuple index out of range

Refs

mlcommons / training

`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

739