FCN5 update for single and multi-GPU performance

tfboyd commented 7 years ago

I clocked the AWS K80s to the same 562MHz with BOOST off as what I think occurs in your tests. K80s can vary based on peering and setup but I think even though your K80s show as not peered that they actually work fine if treated as if they are but I cannot know for sure.

For TF 1.1 (which I am guessing you are running now) New args: use_datasets=False and xla=False, which are the defaults so this will work on TF 1.1 without issues.
- 4 GPUs should now be about 100ms
- 1 GPU (K80) 1024 batch-size should now be about 57ms
For TF ~1.2(RC1) with the use of use_datasets=True and xla=True. DataSets is the bigger improvement as feeding the data via feed_dict results in some hiccups that throws off the average and feed_dict does not scale well beyond 1 GPU.
- 4 GPUs is about 100ms
- 1 GPU (K80) 1024 batch-size should now be about 50ms which would if our clocks are the same is nearly best in class.
- GTX 1080 (clocked to match yours) 1024 batch-size should be about 15.3ms, which if our clocks are the same is a big improvement and possibly best in class.
Regardless of version the final loss is now about half of what it was before and I also add training accuracy to the multi-gpu because it is just nice to have.

I do not expect you to run TF 1.2 given it is RC and you already started the v8 tests. I would appreciate it if this code was used (after you approve it) for the v8 runs using the features available in TF 1.1. You could even use xla, which is a boost for single GPU with TF 1.1 but you need to compile it in by choosing 'Y' during ./configure. Even without XLA this provides users a better resource. I will likely publish select numbers on our website with the TF 1.2 release to highlight the improvements

Finally, your numbers may vary from mine. Slightly different versions of TensorFlow as well as compile options and obviously hardware. I completely understand and it is expected. This is also a simple network I doubt anyone runs but it is fun to tweak and it shows how important the input pipeline is to getting the best possible performance.

Good Luck with V8

p.s. I might have some more tweaks to Alexnet and Resnet but they are at least OK. NCHW is a big deal but I can always post results on our main website with the updated code.

shyhuai commented 7 years ago

Hi, @tfboyd , many thanks for your contribution. The core frequency of the K80 GPU is kept at the default core clock with 562MHz, and the auto boost is off. The two K80s on a tested machine are not peered, which may result in the performance decrease if one doesn't perform well at data synchronization on multi-gpu environments. We will first release the results of your codes that work properly on tf1.0(4ac9c09) since our v8 version will not update to tf1.2. Thank you!

FreemanX commented 7 years ago

Thank you! We tested on our K80 with frequency settings being 562MHz for sm and 2505MHz for memory. With use_datasets=False, xla=False, batch_size=1024, we get around 60ms/batch.

tfboyd commented 7 years ago

60ms is not far from my test so that is good and confirms that my setup is not far from yours. I have alexnet changes where I move to NCHW I will submit tomorrow.

tfboyd commented 7 years ago

@FreemanX

I set the defaults to work with TF 1.0 or 1.1. So you do not need to change any of your scripts. I was not remotely asking you to run on 1.2. I am doing the same with my other Pull Requests.

hclhkbu / dlbench

FCN5 update for single and multi-GPU performance #17