horovod / horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
http://horovod.ai
Other
14.18k stars 2.23k forks source link

tensorflow hvd.DistributedOptimizer bug #3994

Open Chenjingliang1 opened 11 months ago

Chenjingliang1 commented 11 months ago

Environment:

  1. Framework: (TensorFlow)
  2. Framework version:2.12
  3. Horovod version: master

hvd.DistributedOptimizer bug,when set groups parma can reproduce error. opt = hvd.DistributedOptimizer(opt, op=hvd.Sum, groups=4)

https://github.com/horovod/horovod/blob/master/horovod/tensorflow/__init__.py#L322

tensors should be replaced with indexed_slices_list.

tensors could contains tensor and IndexedSlices ,only IndexedSlices has dense_shape attr.

new_indexed_slices = [tf.IndexedSlices(x, i, dense_shape=t.dense_shape) for x,i,t in zip(new_values, new_indices, tensors)]
->
new_indexed_slices = [tf.IndexedSlices(x, i, dense_shape=t.dense_shape) for x,i,t in zip(new_values, new_indices, indexed_slices_list)]
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 764, in compute_gradients
[0]<stderr>:    avg_grads = _filtered_reduce_grads(grads, vars)
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 729, in _filtered_reduce_grads
[0]<stderr>:    rg = self._allreduce_grads(rg, rv)
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 601, in allreduce_grads
[0]<stderr>:    ignore_name_scope=use_generic_names)
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 411, in _grouped_allreduce_cond
[0]<stderr>:    allreduce_fn, id_fn)
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[0]<stderr>:    raise e.with_traceback(filtered_tb) from None
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 401, in allreduce_fn
[0]<stderr>:    return grouped_allreduce(tensors, *args, process_set=process_set, **kwargs)
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 324, in grouped_allreduce
[0]<stderr>:    dense_shape=t.dense_shape) for x,i,t in zip(new_values, new_indices, tensors)]
[0]<stderr>:  File "/opt/apps/local/lib64/python3/dist-packages/horovod/tensorflow/__init__.py", line 324, in <listcomp>
[0]<stderr>:    dense_shape=t.dense_shape) for x,i,t in zip(new_values, new_indices, tensors)]
[0]<stderr>:AttributeError: 'Tensor' object has no attribute 'dense_shape'