tensorflow / models

Models and examples built with TensorFlow
Other
76.98k stars 45.79k forks source link

"`merge_call` called while defining a new graph or a tf.function." #9972

Open Zrufy opened 3 years ago

Zrufy commented 3 years ago

starting the training with the command python model_main_tf2.py --logtostderr --model_dir = training / --pipeline_config_path = training / ssd_mobilenet_v1_focal_loss_pets.config i get this error. Has anyone already encountered this error? Are there any solutions?

mrinal18 commented 3 years ago

What is the error? Which system are you using? Please share your code config file what is the command you are using? which tf version is being used?

Zrufy commented 3 years ago

1)I don't know the error for this I was asking. 2)Windows 3)config 4)I have already entered the command I use 5)2.4.1

mrinal18 commented 3 years ago

I am a little confused by the description. can you clarify what you meant by

i get this error.
Has anyone already encountered this error? Are there any solutions?

the reason for this confusion is that i don't see any error message pointed out in the description.

Zrufy commented 3 years ago

the error is in the title ""merge_call called while defining a new graph or a tf.function."". Complete is : " RuntimeError:merge_callcalled while defining a new graph or a tf.function. This can often happen if the functionfnpassed tostrategy.run()contains a nested@tf.function, and the nested@tf.functioncontains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the functionfnuses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nestedtf.functions or control flow statements that may potentially cross a synchronization boundary, for example, wrap thefnpassed tostrategy.runor the entirestrategy.runinside atf.functionor move the control flow out offn"

mrinal18 commented 3 years ago

okay got it. can you share the code, logs where you are seeing this issue? or code to reproduce the issue

thanks

mrinal18 commented 3 years ago

Please also refer to this closed issue link for your reference.

Zrufy commented 3 years ago
INFO:tensorflow:Error reported to Coordinator: in user code:

    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:613 train_step_fn  *
        loss = eager_train_step(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:310 eager_train_step  *
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:99 apply_gradients  *
        self.update_average(self.iterations)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:124 update_average  *
        self._model_weights),))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2941 merge_call  **
        return self._merge_call(merge_fn, args, kwargs)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py:433 _merge_call
        "`merge_call` called while defining a new graph or a tf.function."

    RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`
Traceback (most recent call last):
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception
    yield
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py", line 323, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 670, in wrapper
    raise e.ag_error_metadata.to_exception(e)
RuntimeError: in user code:

    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:613 train_step_fn  *
        loss = eager_train_step(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:310 eager_train_step  *
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:99 apply_gradients  *
        self.update_average(self.iterations)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:124 update_average  *
        self._model_weights),))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2941 merge_call  **
        return self._merge_call(merge_fn, args, kwargs)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py:433 _merge_call
        "`merge_call` called while defining a new graph or a tf.function."

    RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`

I0504 13:43:43.337412 10136 coordinator.py:219] Error reported to Coordinator: in user code:

    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:613 train_step_fn  *
        loss = eager_train_step(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:310 eager_train_step  *
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:99 apply_gradients  *
        self.update_average(self.iterations)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:124 update_average  *
        self._model_weights),))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2941 merge_call  **
        return self._merge_call(merge_fn, args, kwargs)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py:433 _merge_call
        "`merge_call` called while defining a new graph or a tf.function."

    RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`
Traceback (most recent call last):
  File "C:\Users\nvidiatesla\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\training\coordinator.py", line 297, in stop_on_exception
    yield
  File "C:\Users\nvidiatesla\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py", line 323, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "C:\Users\nvidiatesla\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 670, in wrapper
    raise e.ag_error_metadata.to_exception(e)
RuntimeError: in user code:

    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:613 train_step_fn  *
        loss = eager_train_step(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:310 eager_train_step  *
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:99 apply_gradients  *
        self.update_average(self.iterations)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:124 update_average  *
        self._model_weights),))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2941 merge_call  **
        return self._merge_call(merge_fn, args, kwargs)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py:433 _merge_call
        "`merge_call` called while defining a new graph or a tf.function."

    RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\absl\app.py", line 303, in run
    _run_main(main, args)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 110, in main
    record_summaries=FLAGS.record_summaries)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py", line 664, in train_loop
    loss = _dist_train_step(train_input_iter)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\def_function.py", line 726, in _initialize
    *args, **kwds))
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\framework\func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\eager\def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\framework\func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
RuntimeError: in user code:

    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:648 _dist_train_step  *
        _sample_and_train(strategy, train_step_fn, data_iterator)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:630 _sample_and_train  *
        per_replica_losses = strategy.run(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:613 train_step_fn  *
        loss = eager_train_step(
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\object_detection\model_lib_v2.py:310 eager_train_step  *
        optimizer.apply_gradients(zip(gradients, trainable_variables))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:99 apply_gradients  *
        self.update_average(self.iterations)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\official\modeling\optimization\ema_optimizer.py:124 update_average  *
        self._model_weights),))
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2941 merge_call  **
        return self._merge_call(merge_fn, args, kwargs)
    C:\Users\user\anaconda3\envs\TSOBJ2\lib\site-packages\tensorflow\python\distribute\mirrored_run.py:433 _merge_call
        "`merge_call` called while defining a new graph or a tf.function."

    RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`
Zrufy commented 3 years ago

i read that error but is not similar the situation to resolve.

mrinal18 commented 3 years ago

Two things that can be done (i tested here and it seems to be working)

  1. remove the @tf.function from _dist_train_step(data_iterator) and then give a run.
  2. add this in global_step global_step = tf.Variable( 0, trainable=False, dtype=tf.compat.v2.dtypes.int64, name='global_step', aggregation=tf.compat.v2.VariableAggregation.ONLY_FIRST_REPLICA, synchronization=tf.VariableSynchronization.ON_READ)

changes 1: line 641

        @tf.function
        def _dist_train_step(data_iterator):

to

        def _dist_train_step(data_iterator):

change2:

line 554:

global_step = tf.Variable(
        0, trainable=False, dtype=tf.compat.v2.dtypes.int64, name='global_step',
        aggregation=tf.compat.v2.VariableAggregation.ONLY_FIRST_REPLICA)

to

global_step = tf.Variable(
        0, trainable=False, dtype=tf.compat.v2.dtypes.int64, name='global_step',
        aggregation=tf.compat.v2.VariableAggregation.ONLY_FIRST_REPLICA, synchronization=tf.VariableSynchronization.ON_READ)

Can you try these options and let me know?

Zrufy commented 3 years ago

just tried the changes but nothing keeps giving me the same error. I changed this in the config type: 'ssd_mobilenet_v1_keras' for the tensorflow version 2.x but cmq I have the same error. I also tried other mobilenet but I have the same error with other net the train going well.

Zrufy commented 3 years ago

this error occur for ssd_mobilenet and mobilenetv2

Zrufy commented 3 years ago

@Mrinal18 any news about this type of error?

Zrufy commented 3 years ago

using the config inside the samples / config folder I have this type of error taking the config from the config / tf2 folder no error. But I would like to understand at this point if it was possible to use that config and that type of model on 2.4.0.

b04505009 commented 3 years ago

I found if I add use_moving_average: false in optimizer then the problem disappeared, but I didn't dig in further.

rcruzgar commented 3 years ago

Hi @Zrufy @Mrinal18 @b04505009 ,

Are there any news on how to solve this? I am having the same error message for _ssd_mobilenet_v2keras, but not with _ssd_efficientnet-b1_bifpnkeras, for example.

Cheers, R.