tensorflow / models

Models and examples built with TensorFlow
Other
77.24k stars 45.75k forks source link

Issue with MirroredStrategy() when trying to interleave training and evaluation for TensorFlow 2 object detection models #8876

Open sglvladi opened 4 years ago

sglvladi commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/sglvladi/models/blob/train_eval/research/object_detection/model_main_tf2.py

2. Describe the bug

I have been trying to modify research/object_detection/model_main_tf2.py so that it interleaves training and evaluation, similar to how research/object_detection/model_main.py did for TensorFlow 1.x. In doing so, I've had to make some changes to research/object_detection/model_liv_v2.py, where I have added a call to eager_eval_loop() when a new checkpoint becomes available. A comparison of all the changes made can be found here. When I try to run model_main_tf2.py with these changes, I get the following error when evaluation is run:

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 299, in run
    _run_main(main, args)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\absl\app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 102, in main
    model_lib_v2.train_loop(
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 685, in train_loop
    eval_step_fn(latest_checkpoint)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 641, in eval_step_fn
    eager_eval_loop(detection_model,
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\object_detection\model_lib_v2.py", line 870, in eager_eval_loop
    loss_metrics[loss_key].update_state(loss_tensor)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\utils\metrics_utils.py", line 90, in decorated
    update_op = update_state_fn(*args, **kwargs)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\keras\metrics.py", line 355, in update_state
    update_total_op = self.total.assign_add(value_sum)
  File "C:\Users\sglvladi\Anaconda3\envs\tf2\lib\site-packages\tensorflow\python\distribute\values.py", line 981, in assign_add
    raise ValueError(
ValueError: SyncOnReadVariable does not support `assign_add` in cross-replica context when aggregation is set to `tf.VariableAggregation.SUM`.

After doing some digging I found that changing the default distribution strategy at line 96 of model_main_tf2.py to OneDeviceStrategy (as shown here) gets rid of the error and I am able to successfully interleave training and evaluation.

3. Steps to reproduce

Pull this version of the models and try to run model_main_tf2.py on any model with properly configured evaluation. For reference purposes, I am trying to train the SSD MobileNet V1 FPN 640x640 with the default config file, having set the appropriate paths and changed the following settings:

model {
  ssd {
    num_classes: 1
    ...
  }
  ...
}

train_config: {
  ...
  fine_tune_checkpoint_type: "detection"
  batch_size: 8
  use_bfloat16: false
  ...
}

The command used to train (and evaluate) the model is the following:

python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config

where the assumption is made that checkpoint_dir is the same as model_dir.

4. Expected behavior

Upon execution of model_main_tf2.py the process should train the model and execute a single evaluation loop every time a new checkpoint is created. Note that this works fine if Distribution Strategy is set to OneDeviceStrategy

5. Additional context

None

6. System information

turowicz commented 4 years ago

Can we get an ETA on this? This is very important for QoL aspects of using TF.

I can see why people are fleeing to Pytorch.

cc @tombstone @ravikyram