I have been trying to modify research/object_detection/model_main_tf2.py so that it interleaves training and evaluation, similar to how research/object_detection/model_main.py did for TensorFlow 1.x. In doing so, I've had to make some changes to research/object_detection/model_liv_v2.py, where I have added a call to eager_eval_loop() when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to run model_main_tf2.py with these changes, I get the following error when evaluation is run:
Traceback (most recent call last):
File "model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 102, in main
model_lib_v2.train_loop(
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
eval_step_fn(latest_checkpoint)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 641, in eval_step_fn
eager_eval_loop(detection_model,
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 870, in eager_eval_loop
loss_metrics[loss_key].update_state(loss_tensor)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\utils\metrics_utils.py", line 90, in decorated
update_op = update_state_fn(*args, **kwargs)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\metrics.py", line 355, in update_state
update_total_op = self.total.assign_add(value_sum)
File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\distribute\values.py", line 981, in assign_add
raise ValueError(
ValueError: SyncOnReadVariable does not support `assign_add` in cross-replica context when aggregation is set to `tf.VariableAggregation.SUM`.
After doing some digging I found that changing the default distribution strategy at line 96 of model_main_tf2.py to OneDeviceStrategy (as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.
3. Steps to reproduce
Pull this version of the models and try to run model_main_tf2.py on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:
where the assumption is made that checkpoint_dir is the same as model_dir.
4. Expected behavior
Upon execution of model_main_tf2.py the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set to OneDeviceStrategy
5. Additional context
None
6. System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device name if the issue happens on a mobile device: N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2
Python version: 3.8
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/sglvladi/models/blob/train_eval/research/object_detection/model_main_tf2.py
2. Describe the bug
I have been trying to modify
research/object_detection/model_main_tf2.py
so that it interleaves training and evaluation, similar to howresearch/object_detection/model_main.py
did for TensorFlow 1.x. In doing so, I've had to make some changes toresearch/object_detection/model_liv_v2.py
, where I have added a call toeager_eval_loop()
when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to runmodel_main_tf2.py
with these changes, I get the following error when evaluation is run:After doing some digging I found that changing the default distribution strategy at line 96 of
model_main_tf2.py
toOneDeviceStrategy
(as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.3. Steps to reproduce
Pull this version of the models and try to run
model_main_tf2.py
on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:The command used to train (and evaluate) the model is the following:
where the assumption is made that
checkpoint_dir
is the same asmodel_dir
.4. Expected behavior
Upon execution of
model_main_tf2.py
the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set toOneDeviceStrategy
5. Additional context
None
6. System information