soumith commented 8 years ago

With Cudnn R3 coming in, improvements to Nervana, and a new kid on the block called Chainer, faster Facebook kernels, I will be doing a minor re-run of the benchmarks to see how things have improved.

Target date: August 15th.

I am still thinking quite a lot on how to take the benchmarks forward, beyond ConvNets, beyond Images (into NLP, Video and Audio) and beyond single-GPU. If any domain experts have suggestions (especially for Audio and NLP), please do write to me.

The only thing that stopped me from multi-GPU benchmarks was the lack of enough frameworks to do benchmarking. This somewhat seemed to have changed, and a decent number of frameworks now support multi-GPU, so will plan on that.

More fun to come soon.

Checklist:

[x] CuDNN R3
- fp 16
- fp 32
[x] Nervana Neon
- fp 16
- fp 32
[ ] Chainer
[x] CL-Torch
[x] CL-Caffe (greentea)
[x] FB-CuNN

scott-gray commented 8 years ago

Any plans to incorporate batchnorm in this run? What about other potential bottlenecks in the network like the data loading and augmentation? Will fp16 testing now become standard with it being available in cuDNN (as far as I know)?

I think we're working on some standard rnn/lstm network benchmarks so that might be a place to start for other network types.

naibaf7 commented 8 years ago

@soumith How could we include Caffe #2610 (https://github.com/BVLC/caffe/pull/2610) for testing?

hughperkins commented 8 years ago

@naibaf7

Technically, objectively:

fork this repo
add a subdirectory, with:
- script that installs your software, runs it, prints benchmarking results
- a short README explaining how to run the script
submit a pull request

Subjectively, my opinion:

you need a unique name
I really like your Greentea name
Suggest using either 'Caffe-Greentea', or simply 'Greentea'

hughperkins commented 8 years ago

For cltorch, just noticed the test/test-perf.lua script, in clnn repo, is missing the layer 2 single-layer timings, so I've pushed an update to clnn repo just now:

uncommented single-layer layer 2 test
removed colors
prints the name of each layer now

If you reinstall clnn, it should bring down these changes, if needed.

soumith commented 8 years ago

@hughperkins thanks, will do.

naibaf7 commented 8 years ago

@soumith When can we expect results of the benchmarks? :)

soumith commented 8 years ago

@naibaf7 i am concluding them. i finished benchmarking everything except fb-cunn, cudnn in FP16 mode and chainer. Hopefully by Monday I will write a detailed comment with my findings.

naibaf7 commented 8 years ago

@soumith Do you also test on the CPU or is only the Titan X evaluated? Because Caffe could also be run in CPU mode, and Greentea supports CPUs very well in the OpenCL mode.

soumith commented 8 years ago

@naibaf7 at the moment I am only doing GPU side of things. For Caffe, Torch, GreenTea I suppose I can also run CPU without too much effort from my side. I did not think people were interested in those.

hughperkins commented 8 years ago

I guess mostly CPU would only be used during development, whereas actual training will tend to be GPU-based?

naibaf7 commented 8 years ago

@hughperkins For the moment, it seems like this. But with upcoming asynchronous solvers and MPI support, as well as good parallelized backends (the Caffe CPU backend is single threaded except for BLAS calls, while the GreenTea OpenCL backend uses parallelized kernels and a parallel CPU BLAS) it might become reasonably interesting again to use existing CPU clusters for training.

A second perspective will be with APU/HSA devices. Using an i7-4790K for example on the Caffe CPU backend gives a speedup of 1x on Alexnet. Using it on the Greentea is almost 2x. The integrated graphics would also evaluate from 1.5x to 2x. When splitting up Alexnet over the integrated graphics and CPU, 4x the speed of the old CPU backend in Caffe can be reached, using the same device. This already approaches the training speed of mid-end GPUs.

Just something to keep in mind when seeing that the future exascale device we have to work with look much like this: http://www.hpcwire.com/2015/07/29/amds-exascale-strategy-hinges-on-heterogeneity/

hughperkins commented 8 years ago

@Fabian Re: APU/HSA devices, interesting, will reply in your PR, to keep this thread clean(er).

hughperkins commented 8 years ago

@Soumith by the way, quick heads-up, before you inadvertently step into a mine-field :-P There are actually multiple Caffe OpenCL forks, none of which have been officially recognized/endorsed, and I am not aware of any plans to merge any of the forks into main Caffe in the near-term.

Therefore, strongly recommend choosing a name for Fabian's fork which does not imply that it is the one and only Caffe OpenCL fork. I think that using 'greentea' or 'caffe greentea' meets this requirement, and will plausibly provide you a mine-free life :-)

soumith commented 8 years ago

@hughperkins thanks for the heads up :) . Everything done except chainer. That should be done tomorrow as well, along with the write-up.

Will do CPU candidates later, but there seem to be a few. I have to collect them, read a bit on each, and get to proper benchmarking each of them. CPU benchmarking gets much more complicated in general.

bhack commented 8 years ago

/cc @gujunli

naibaf7 commented 8 years ago

@soumith Oh you plan on doing CPU? Cool :) Remember to use a good BLAS (OpenBLAS compiled from source or MKL) on your CPU with Greentea and Caffe and remember to configure it in Makefile.config. Additionally, on Greentea, the CPU must be found with device_query and using the -gpu=x flag, where x is the ID of the CPU.

Thanks :)

hughperkins commented 8 years ago

@soumith Note that the 3rd Caffe OpenCL fork is public now :-P https://github.com/gujunli/OpenCL-caffe-upstream-test.

bhack commented 8 years ago

@soumith I think that @michaellarabel of phoronix.com has many interesting hardware to run your benchmark. I think that you can talk with him.

naibaf7 commented 8 years ago

@bhack @hughperkins @soumith We compared the OpenCL performance. Using nVidia hardware, Alexnet will run approximately with the same speed using https://github.com/BVLC/caffe/pull/2610 and https://github.com/BVLC/caffe/pull/2195 PRs of Caffe.

So what you will see in your benchmarks is equally applicable at the moment. For AMD hardware, it would be a bit different right now.

hughperkins commented 8 years ago

Note: apparently the link to the AMD repo above is just a test link, gone now.

bhack commented 8 years ago

@hughperkins A really strange timing considering its commit history was older then one month and license in source file was already change with AMD copyright.

naibaf7 commented 8 years ago

@bhack @hughperkins Please no speculation for now, just adds to the confusion I think :) We're planning on discussing how to go forward with OpenCL soon, as it's obviously not very good for the cause to have so many branches.

@bhack You kind of advertised my branch/PR everywhere.

bhack commented 8 years ago

@naibaf7 Yes AMD people under coverage on github doesn't really help to clarify the situation. I'm confident that you can put back the discussion in a public space as soon you can.

hughperkins commented 8 years ago

Concur with bhack's view.

naibaf7 commented 8 years ago

@bhack @hughperkins

Yes I know but as a clarification, I think AMD is okay with me sharing this: One of the branches from junli gu was actually a research branch for AMD internal. Robert started his OpenCL branch inofficially in spare time. I shortly after that started my OpenCL branch as part of my thesis with AMD sponsoring.

As of now, there is no OpenCL branch that has the full official support by AMD. Please note that all branches have their pros and cons, and this needs to be resolved, as well as a plan on how to proceed (who does what, plans on merging or feature branch, device abstraction, speeding up the convolutions...)

2195 and #2610 have also been concurrent projects until now, which did power some advancements but also lead to heated discussions. This will be resolved now and we will plan collaboration.

However it is a bad idea to lead all discussions public first (confusion), especially because only a handful of developers are involved.

hughperkins commented 8 years ago

One of the branches from junli gu was actually a research branch for AMD internal. Robert started his OpenCL branch inofficially in spare time. I shortly after that started my OpenCL branch as part of my thesis with AMD sponsoring.

It seems a bit sub-optimal to me for a company which gives the impression to me of not being overly endowed with cash, to use three resources to work independently on the exact same problem, whilst the AMD compiler still is buggy, and optimization plausibly patchy. Even if you argue 'well, two of them were working for free', yes but the opportunity cost is still massive. Theano is still cuda-only, Chainer too, and so on.

naibaf7 commented 8 years ago

@hughperkins That's why I would prefer to not see speculation on here. I just told you what I know the clear the situation, now I'll discuss with AMD how to proceed, so let's see about that first before jumping to conclusions.

Besides, the branches also have some specific advantages and domain specific optimizations, so there's a lot to profit from now when going forward on OpenCL.

soumith commented 8 years ago

@hughperkins when a company is large and distributed, parallel efforts might also happen out of interest. let's not waste a discussion on this speculation. FYI there's also a C++ AMP implementation of the Torch backends for AMD (funded by AMD) here: https://github.com/NEELMCW/MCW_CPPAMP_TORCH but not as clean and nice as your stuff.

hughperkins commented 8 years ago

@soumith Basically, I dont want to see AMD go the way of Sun, since AMD are pretty much the main competitor to NVIDIA right now, although Intel do have some offerings, but mostly integrated GPUs right now, AFAIK? Sun in my opinion seemed to invest tons of money in opensource, which didnt seem to generate any return. I reckon that AMD in my opinion should either focus on AMD-specific stuff, like the compilers and so on, or else, I dont see any reason why they cant make proprietary, non-free, libraries, that generate revenue. Intel do this with MKL, and MKL seems to be doing ok for itself. I know this might seem odd coming from someone who writes lots of opensource, but I'd rather have an AMD producing non-free stuff and surviving, than producing lots of free stuff, and get eaten :-(

naibaf7 commented 8 years ago

@soumith I've seen you changed the ViennaCL installation script on Greentea. Are you on a distribution which does not ship ViennaCL via APT/DNF/YUM?

soumith commented 8 years ago

@naibaf7 i am on ubuntu 14.04, but the makefile wasn't picking it up, so I just did things manually, did not look too much into it.

naibaf7 commented 8 years ago

Ok that's interesting. Here and on Amazon servers it did with Ubuntu 14.04... well anyways, cool that you could make it work for you. The more advanced method is to switch to clBLAS. But I think it's a pretty cool feature to be able to use both BLAS libraries to provide easy installation & flexibility.

bhack commented 8 years ago

@michaellarabel has an AMD R9 Fury close at hand. @naibaf7 It would be great to benchmark on that.

gujunli commented 8 years ago

I have a R9 Fury, if anyone wants to test performance on it. I am not sure how easy the access is. but if I can access your code, that will be easy. @naibaf7 https://github.com/naibaf7 @hughperkins https://github.com/hughperkins

On Thu, Aug 27, 2015 at 1:45 AM, bhack notifications@github.com wrote:

@michaellarabel https://github.com/michaellarabel has an AMD R9 Fury under the hand. @naibaf7 https://github.com/naibaf7 It would be great to benchmark on that.

— Reply to this email directly or view it on GitHub https://github.com/soumith/convnet-benchmarks/issues/46#issuecomment-135342879 .

Junli Gu--谷俊丽 Coordinated Science Lab University of Illinois at Urbana-Champaign

gujunli commented 8 years ago

@naibaf7 I like your early comment of you working on HSA/APU. We have done some work of CAFFE on APU. I would like to discuss with you.

gujunli commented 8 years ago

@Soumith We have some evaluation results of OpenCL CAFFE (reserach lab's internal) on Fury and W9100. I wonder whether it helps for me to share the results with you but now the code for now. We evaluated against TitanX and GTX980. But it might be nice to know where we are now on your performance list.

bhack commented 8 years ago

@gujunli is "but now the code for now" interpreted as "but not the code for now"?

bhack commented 8 years ago

@soumith Is the Intel framework at https://github.com/01org/idlf testable?

hughperkins commented 8 years ago

re: "can I submit results without sourcecode, using a non-publically available library?" What benefit do you see in doing this?

bhack commented 8 years ago

@hughperkins It was a my fault to have pointed all to an internal repository of AMD Chinese research center that was maintained on a public github repository.

hughperkins commented 8 years ago

@bhack No, I put Junli on my 'follow' list long ago, so I saw anyway :-)

bhack commented 8 years ago

@hughperkins It was not an AMD Chinese center but the USA research center in Illinois.

hughperkins commented 8 years ago

@bhack You know that AMD might have more than one office right? :-P

bhack commented 8 years ago

@hughperkins Yes my bad. Sunnyvale, California

hughperkins commented 8 years ago

@bhack hmmm, what I thought I was communicating is not exactly what I communicated. Anyway.... I wouldnt read too much into geographic locations, it is a global corporation.

bhack commented 8 years ago

@hughperkins Yes but my "location" was referenced only by declaration of "I am leading the AMD research's DNN project". So probably she is directing this effort with distributed resources around the globe and there is really no physical DNN group co-located in Sunnyvale facilities.

naibaf7 commented 8 years ago

@bhack @hughperkins Okay... you are very off topic now, just saying :D Not sure if @soumith is very happy with that.

To stop speculation once again: Currently the DNN people are physically in the same location, so don't worry about coordination, it will be great. AMD is pulling together all the important people to get this done the right way.

I also found AMD more reachable and easier to contact than nVidia and Intel, it is fun to work with them. I once won an Intel ISEF award and even then it was crazy difficult to get into contact with anyone other than public relations people at Intel.

bhack commented 8 years ago

@naibaf7 Yes we are surely off topic, it is true. I'm really happy that AMD is starting to have some kind of direction and coordination now but you can agree with me that the approach was quite confusing and sparse. Removing repositories just after a little bit of posting and without a comment is generally not a good marketing for a big company like AMD. But nevermind it was only a start with some little false steps.

hughperkins commented 8 years ago

Currently the DNN people are physically in the same location, so don't worry about coordination

I'm not sure that's exactly quite true, unless your definition of 'DNN people' is different than my own interpretation of this.

I also found AMD more reachable and easier to contact than nVidia and Intel

I never tried to contact nvidia. Or Intel actually.... I reckon for cuda projects, they have a fair amount of contact with nvidia though. Caffe has sponsorship by nvidia, or at lesat provision of one or more nvidia gpus by nvidia, right on their front page.

bhack commented 8 years ago

@hughperkins Really it is off topic. The only important thing here it is to understand what kind version of opencl caffe will be benchmarked. I don't know if it is useful to have benchmark results of private versions but only @soumith could tell us.

soumith / convnet-benchmarks

[August 2015] Rejigging the marks... #46

2195 and #2610 have also been concurrent projects until now, which did power some advancements but also lead to heated discussions. This will be resolved now and we will plan collaboration.