Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[x] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[x] I checked to make sure that this issue has not already been filed.

1. The entire URL of the file you are using

https://github.com/sglvladi/models/blob/train_eval/research/object_detection/model_main_tf2.py

2. Describe the bug

I have been trying to modify research/object_detection/model_main_tf2.py so that it interleaves training and evaluation, similar to how research/object_detection/model_main.py did for TensorFlow 1.x. In doing so, I've had to make some changes to research/object_detection/model_liv_v2.py, where I have added a call to eager_eval_loop() when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to run model_main_tf2.py with these changes, I get the following error when evaluation is run:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 102, in main
    model_lib_v2.train_loop(
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
    eval_step_fn(latest_checkpoint)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 641, in eval_step_fn
    eager_eval_loop(detection_model,
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 870, in eager_eval_loop
    loss_metrics[loss_key].update_state(loss_tensor)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\utils\metrics_utils.py", line 90, in decorated
    update_op = update_state_fn(*args, **kwargs)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\metrics.py", line 355, in update_state
    update_total_op = self.total.assign_add(value_sum)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\distribute\values.py", line 981, in assign_add
    raise ValueError(
ValueError: SyncOnReadVariable does not support `assign_add` in cross-replica context when aggregation is set to `tf.VariableAggregation.SUM`.

After doing some digging I found that changing the default distribution strategy at line 96 of model_main_tf2.py to OneDeviceStrategy (as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.

3. Steps to reproduce

Pull this version of the models and try to run model_main_tf2.py on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:

model {
  ssd {
    num_classes: 1
    ...
  }
  ...
}

train_config: {
  ...
  fine_tune_checkpoint_type: "detection"
  batch_size: 8
  use_bfloat16: false
  ...
}

The command used to train (and evaluate) the model is the following:

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

where the assumption is made that checkpoint_dir is the same as model_dir.

4. Expected behavior

Upon execution of model_main_tf2.py the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set to OneDeviceStrategy

5. Additional context

None

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device name if the issue happens on a mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2
Python version: 3.8
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: CUDA 10.1/ cuDNN 7.5.6
GPU model and memory: nVidia GTX 1070 Ti

tensorflow / models

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876