hclhkbu / dlbench

Benchmarking State-of-the-Art Deep Learning Software Tools
http://dlbench.comp.hkbu.edu.hk/
MIT License
170 stars 47 forks source link

FCN5 update for single and multi-GPU performance #17

Closed tfboyd closed 7 years ago

tfboyd commented 7 years ago

I clocked the AWS K80s to the same 562MHz with BOOST off as what I think occurs in your tests. K80s can vary based on peering and setup but I think even though your K80s show as not peered that they actually work fine if treated as if they are but I cannot know for sure.

I do not expect you to run TF 1.2 given it is RC and you already started the v8 tests. I would appreciate it if this code was used (after you approve it) for the v8 runs using the features available in TF 1.1. You could even use xla, which is a boost for single GPU with TF 1.1 but you need to compile it in by choosing 'Y' during ./configure. Even without XLA this provides users a better resource. I will likely publish select numbers on our website with the TF 1.2 release to highlight the improvements

Finally, your numbers may vary from mine. Slightly different versions of TensorFlow as well as compile options and obviously hardware. I completely understand and it is expected. This is also a simple network I doubt anyone runs but it is fun to tweak and it shows how important the input pipeline is to getting the best possible performance.

Good Luck with V8

p.s. I might have some more tweaks to Alexnet and Resnet but they are at least OK. NCHW is a big deal but I can always post results on our main website with the updated code.

shyhuai commented 7 years ago

Hi, @tfboyd , many thanks for your contribution. The core frequency of the K80 GPU is kept at the default core clock with 562MHz, and the auto boost is off. The two K80s on a tested machine are not peered, which may result in the performance decrease if one doesn't perform well at data synchronization on multi-gpu environments. We will first release the results of your codes that work properly on tf1.0(4ac9c09) since our v8 version will not update to tf1.2. Thank you!

FreemanX commented 7 years ago

Thank you! We tested on our K80 with frequency settings being 562MHz for sm and 2505MHz for memory. With use_datasets=False, xla=False, batch_size=1024, we get around 60ms/batch.

tfboyd commented 7 years ago

60ms is not far from my test so that is good and confirms that my setup is not far from yours. I have alexnet changes where I move to NCHW I will submit tomorrow.

tfboyd commented 7 years ago

@FreemanX

I set the defaults to work with TF 1.0 or 1.1. So you do not need to change any of your scripts. I was not remotely asking you to run on 1.2. I am doing the same with my other Pull Requests.