Fix wrong accuracy fetches when single_session is used.

gyeongchan-yun commented 4 years ago

Hello,

I'm wondering whether accuracy op is correct or not in 'distributed_all_reduce'. To support my doubt, the log message is here. Top-5 accuracy is over 1.0!

I run it on 4 servers with 4 GPUs (total GPUs: 16) with the following configuration. Model: resenet56 Dataset: cifar10 (synthetic) Mode: BenchmarkMode.TRAIN SingleSess: True Variables: distributed_all_reduce AllReduce: xring

Step Img/sec total_loss top_1_accuracy top_5_accuracy 1 images/sec: 5.6 +/- 0.0 (jitter = 0.0) 3.286 0.445 1.906 10 images/sec: 52.6 +/- 81.8 (jitter = 29.5) 2.580 0.484 1.945 20 images/sec: 99.3 +/- 42.3 (jitter = 29.1) 2.522 0.500 2.164 30 images/sec: 141.0 +/- 28.7 (jitter = 19.9) 2.519 0.398 1.852 40 images/sec: 178.4 +/- 21.7 (jitter = 17.6) 2.511 0.391 1.992 50 images/sec: 212.3 +/- 17.7 (jitter = 24.9) 2.507 0.375 2.047 60 images/sec: 242.8 +/- 14.8 (jitter = 25.6) 2.509 0.312 1.891 70 images/sec: 270.8 +/- 12.7 (jitter = 27.8) 2.503 0.391 2.094 80 images/sec: 297.0 +/- 11.4 (jitter = 29.8) 2.502 0.414 2.008 90 images/sec: 321.3 +/- 10.4 (jitter = 34.8) 2.503 0.391 1.914 100 images/sec: 343.0 +/- 9.5 (jitter = 38.9) 2.505 0.391 1.883

In benchmark_cnn.py,

  def _build_fetches(self, global_step, all_logits, losses, device_grads,
                     enqueue_ops, update_ops, all_accuracy_ops, phase_train):
    """Complete construction of model graph, populating the fetches map."""
    fetches = {}
    if enqueue_ops:
      fetches['enqueue_ops'] = enqueue_ops
    for name, ops in all_accuracy_ops.items():
      if name.startswith(constants.UNREDUCED_ACCURACY_OP_PREFIX):
        key = name[len(constants.UNREDUCED_ACCURACY_OP_PREFIX):]
        fetches[key] = tf.concat(ops, 0)
      else:
        # The problem if single_session is used.
        fetches[name] = tf.reduce_sum(ops) / self.batch_size  
        if self.task_index == 0 and self.params.summary_verbosity >= 1:
          tf.summary.scalar(name, fetches[name])

I think this is correct accuracy_ops if single_session is used.

fetches[name] = (tf.reduce_sum(ops) / self.batch_size * 
                 (self.num_workers  if self.single_session else 1))

Best regards, Gyeongchan Yun

googlebot commented 4 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

gyeongchan-yun commented 4 years ago

@googlebot I signed it!

googlebot commented 4 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

gyeongchan-yun commented 4 years ago

You are failing some tests, like testReplicated. To reproduce, run:
python -m unittest -v benchmark_cnn_test.TfCnnBenchmarksTest.testReplicated
The issue is self.single_session is not set during evaluation. Probably the best way to fix this would be to set self.single_session in the constructor:
self.single_session = params.variable_update == 'distributed_all_reduce'
Then, change the if-statement here to check if self.single_session is True instead, and no longer set self.single_session in the _build_graph function.

Thanks for your kind suggestion! I've updated it and confirmed passing the test.

Best Regards, Gyeongchan Yun

tensorflow / benchmarks