Alexnet NCHW, split variables across GPUs and DataSet support

tfboyd commented 7 years ago

Changed Alexnet to use NCHW and also added dataset support. The losses look similar to what they were before in the logs so I feel decent that everything is still copacetic.

I did not test on TF 1.0 (release in Feb) but I did check that it works on TF 1.1. I moved to tf.layers but that was part of the TF 1.0 API so I think it is fine.

This should bring Alexnet down to single digits (6-7ms or so) for batch-size 128 on GTX 1080 even on TF 1.1 (so likely TF 1.0) but who knows still should be way better than 25ms (which is awful). For Multi-GPU the results are also dramatic. I do not recall the exact number on K80s without datasets because I am not very interested but I think it was in the 70ms range. Again with my GPUs peered it is hard to know and your version of TF is a little older. With TF 1.2 and DataSets it was ~36ms but the AWS instance is not exactly the same. Either way it should be better. FYI if you are still running this when TF 1.2+ is out and you use the data_sets=True flag you will need to download or point to the python version of CIFAR data set. Again, I am not sure why my words were not understood. I am not nor was I asking you to run TF 1.2 (it is not even released it is in RC). I was sharing the numbers.

I also deleted a bunch of dead code and commented out code. It felt sloppy and made it hard to follow. I am not very good at TensorFlow so let me know if I missed something. I checked the losses and I do not think I changed anything that is unfair. It looks like you had distortions turned off on the other platforms as well as TensorFlow. I just made the path more direct to avoid any needless transforms.

No more changes on Alexnet from me. This 100% matches the Performance Guide.

tfboyd commented 7 years ago

This also improves CPU performance. I was not able to get the exact CPU but doing A/B testing of the old script vs. the new script showed a substantial improvement.

tfboyd commented 7 years ago

I also have Resnet finish and am adding the final touches, which will be different pull request. Just like Torch we also have improvements to the LSTM to call cuDNN

tfboyd commented 7 years ago

A/B testing to the previous code it seems much faster on 1 thread and maybe the same on 16 threads but my CPU is different. I tried to find something close on AWS None-the-less even if the CPUs are different it was faster A/B.

Edited: so hard to tell on different hardware

shyhuai commented 7 years ago

Thanks. It achieves faster speed. As the performance guides of different platforms were updated so frequently, and some platforms don't have such documents, it's hard for us to implement the benchmark scripts that achieve best performance of each platform. Your contributions are really very helpful to this project, and we also hope that the authors or experts of other platforms could provide the better scripts. In addition, some platforms with newer versions have supported cuDNN in LSTM, we are revising the scripts to make this comparison fair.

tfboyd commented 7 years ago

@shyhuai I understand. I do this testing all the time and it is really hard to get the models to match as they get more complicated. NVIDIA has multiple teams that we work with that ensure when they do testing the models are as exactly the same as possible and there are still small mistakes.

Keep up the good work, have fun and thank you for looking over my PRs,

Toby

p.s. If any of my changes ever look incorrect let me know. I do not under any circumstances want to break a model or have deviate from what everyone else is doing. Winning or losing a benchmark with he wrong code doesn't help anyone. We are working to figure out why we are still farther behind than I would like on the ResNet example on K80, but the number I think you should be getting now is good and is what it is.

hclhkbu / dlbench

Alexnet NCHW, split variables across GPUs and DataSet support #18