eladweiss commented 6 years ago

After taking the latest benchmarks, we noticed a drop in performance on models inception3 and resnet152. Testing with TensorFlow r1.5 on 32xP100 GPUs (8 servers), imagenet data, batch size 64.

Inception3:

grpc: 3350 ==> 3000
grpc + verbs: 3800 ==> 3150

Resnet152:

grpc: 2050 ==> 2000
grpc + verbs: 2450 ==> 2250

We isolated the 'problematic' change to: https://github.com/tensorflow/benchmarks/commit/82dd0539c76afa8491e50d8f796e686b4d97b988#diff-3269d1838b2ebc9c6c071802fb946ca1R521

After replacing the specific call to map_and_batch(), with the previous call to map() with 16 parallel calls (https://github.com/Mellanox/benchmarks/commit/56e0b2298f835905f7d8a53c5bf482ed1dce55fd), we get high numbers again. We don't have a theory to explain this.

Thanks

tfboyd commented 6 years ago

I will have someone take a look, it might also impact other tests. I finally got nightly tests up (in the last week maybe) but I do not have anything with distributed only multi-GPU on DGX-1s.

Thank you for linking to the change in question.

eladweiss commented 6 years ago

@yanivbl @shamoya @shimonran @tfboyd Thanks! I guess we can assist in distributed testing if a patch is available. (I'll need to schedule this with my supervisors, as our lab is currently very busy).

asispatra commented 6 years ago

Hi @tfboyd ,

I have faced the similar issue. while moving from benchmarks git commit f5d85ae to 82dd053 there is significant performance drops for the following change. I am using 4 GPUs and batch-size 64 per GPU.

https://github.com/tensorflow/benchmarks/compare/f5d85ae...82dd053#diff-3269d1838b2ebc9c6c071802fb946ca1R522

If I look into NV profile, as an effect, it shows that data transfer from CPU to GPU is taking longer time for the same amount of data.

This performance issue is there in the master branch too. To get the performance back do I need to go back to commit f5d85ae? OR Are we planning to fix this in the master branch?

Thanks.

tfboyd commented 6 years ago

Sorry, I got distracted. @reedwm Can you take a look at the diff? We still do not have a OSS distributed test to verify this externally but if the change does not impact the multi-gpu (single node) then maybe we can do a roll back. There was an offer to test a patch for us if we can provide a PR or branch.

anpark commented 6 years ago

Same problem here, if i use batching.map_and_batch, it's much slower than first batch and then map, example code: if use_map_and_batch:

x: serialized_example, y: index in current batch

 dataset = dataset.apply(
      batching.map_and_batch(map_func=lambda x, y :  parse_fn(x, batch_pos=y), 
         batch_size=batch_size_per_split, num_parallel_batches=self.num_splits))

else: dataset = dataset.batch(self.batch_size) dataset = dataset.map(lambda x: parse_fn(x), num_parallel_calls=self.num_data_mapper)

i found one reason is: parse_example(...) is much faster then pasrse_single_example(...)

eladweiss commented 6 years ago

I have a certain theory, not sure if it's correct.

Looking at the worker CPU utilization graphs, it is possible that the increased parallelism of MapAndBatch(), while making the pre-processing finish faster, actually steals resources from the worker's CPU processing thread (because the pre-processing is now utilizing all of the cores).

If I am correct, than the peak of CPU utilization at the start of the graph is the preprocessing, and the trailing tail is the worker's CPU processing.

From what I see, the CPU processing is in fact the trigger to finish the step, and the preprocessing is far from being a bottleneck. Also I note that the processing thread does not require a lot of CPU most of the time, but perhaps it requires a little more on step's start.

asispatra commented 6 years ago

@reedwm , Is there any progress to resolve this issue?

reedwm commented 6 years ago

Not yet, but I hope to look at it soon.

@eladweiss, thank you for your analysis! In benchmark_cnn.py, we set the env var TF_GPU_THREAD_MODE to gpu_private, which gives each GPU two dedicated threads. We do this because we observed exactly what you described: preprocessing threads steal resources from the non-preprocessor threads, delaying the scheduling of GPU kernels, and hence delaying the GPU from doing work.

From your analysis, it seems the issue is probably still occurring. Perhaps now, preprocessing threads are stealing resources from CPU ops instead of GPU ops. I will try to look into this soon.

tensorflow / benchmarks

Benchmark performance drops significantly when using map_and_batch #137

Inception3:

Resnet152:

x: serialized_example, y: index in current batch