mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

TypeError: Expected value to be mirrored across replicas: SyncOnReadVariable: #576

Closed ZhangYimeng98 closed 1 year ago

ZhangYimeng98 commented 2 years ago

python version:3.7.3 tensorflow version:2.4.0

TypeError: in user code:

/work/mlperf-meituan/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn  *
    step_fn(iterator)
/work/mlperf-meituan/training/image_classification/tensorflow2/resnet_runnable.py:330 _apply_grads_and_clear  *
    distribution.extended.call_for_each_replica(
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica  **
    return self._call_for_each_replica(fn, args, kwargs)
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
    self._container_strategy(), fn, args, kwargs)
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
    return _call_for_each_replica(strategy, fn, args, kwargs)
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
    coord.join(threads)
/usr/local/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py:389 join
    six.reraise(*self._exc_info_to_raise)
/usr/local/lib/python3.7/site-packages/six.py:703 reraise
    raise value
/usr/local/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
    yield
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
    **merge_kwargs)
/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:683 _distributed_apply  **
    var, apply_grad_to_update_var, args=(grad,), group=False))
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2494 update
    return self._update(var, fn, args, kwargs, group)
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_strategy.py:710 _update
    fn(v, *distribute_utils.select_replica_mirrored(i, args),
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_utils.py:160 select_replica_mirrored
    return nest.map_structure(_get_mirrored, structured)
/usr/local/lib/python3.7/site-packages/tensorflow/python/util/nest.py:659 map_structure
    structure[0], [func(*x) for x in entries],
/usr/local/lib/python3.7/site-packages/tensorflow/python/util/nest.py:659 <listcomp>
    structure[0], [func(*x) for x in entries],
/usr/local/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_utils.py:152 _get_mirrored
    (x, structured))

TypeError: Expected value to be mirrored across replicas: SyncOnReadVariable:{
  0: <tf.Variable 'conv1/kernel:0_accum:0' shape=(7, 7, 3, 64) dtype=float32>
} in (SyncOnReadVariable:{
  0: <tf.Variable 'conv1/kernel:0_accum:0' shape=(7, 7, 3, 64) dtype=float32>
},).
PaulDelestrac commented 1 year ago

Same issue here, did someone find a solution?

johntran-nv commented 1 year ago

@sgpyc , any idea what's going on here?

PaulDelestrac commented 1 year ago

I managed to solve this for my case, I was only using a single GPU but I was keeping the num_accumulation_step to 2. Changing it by 1 made the training loop work.

johntran-nv commented 1 year ago

Glad to hear you figured this out. Closing.