Extended goals - Githubissues

bhack commented 8 years ago

This repository started targeting convolution and its proximity context. I'm guessing if we could extend this goal to be a multi stakeholder extended API for dnn. This require the introduction of other common accellerated kernels/ops in the API (probably starting with porting those that in Opencl Caffe branch). I've opened this to collect feedbacks on the possible roadmap cause I think that the Opencl dnn scenario was historically too fragmented and the standard the facto accellaration API design was roadmapped by cudnn. /cc @edgarriba @hughperkins @gongzg (Intel). I don't /cc amd cause I think its Opencl caffe fork has no more resources allocated and it seems more involved in the hccaffe effort.

edgarriba commented 8 years ago

@bhack thx to take this initiative! I'm in.

The chosen approach for generating the kernels source code makes the library quite singular and attractive to explore more the capabilities by adding additional kernels.

bhack commented 8 years ago

/cc @mtamburrano (as Opencv GSoC Mentor)

naibaf7 commented 8 years ago

@bhack Good idea :) I'll be back on improving libDNN after the 14th of July, and will consider which kernels to generalize into libDNN from Caffe.

bhack commented 8 years ago

@hughperkins What are you experimenting in https://github.com/hughperkins/gpu-experiments?

CNugteren commented 8 years ago

I am also willing to contribute on the kernel development and tuning. In particular if there is something that would be fit as an extension to my BLAS library such as a batched GEMM or some other special version of matrix-multiplication.

bhack commented 8 years ago

@CNugteren You are welcome.

gongzg commented 8 years ago

@bhack @naibaf7 libdnn is a good idea. I will be happy to contribute.

bhack commented 8 years ago

@gongzg Nice to have you on board! @keryell do you have somebody in Xilinx that could be interested in this?

edgarriba commented 8 years ago

welcome everybody!

after this first setup of the group I would like to extend you some ideas and questions that came to my mind during the design of a proper architecture for tiny-cnn (sorry for the advertisement :smile:) where I want to plug the LibDNN concept and which I think that will be a good starting point to shape this project.

Should libdnn handle all the OpenCL, CUDA backend stuff or it's the consumer framework who will do it?
How many kernels and for which platforms, devices we're going to provide support, native CPU, OpenCL CPU/GPU/FPGA/etc, CUDA, CUDNN ?
How the kernels will be attached, as it is right now or in physical independent sources?

As said this are only thoughts. Feel free to comment them, add more or whatever!

naibaf7 commented 8 years ago

@edgarriba My thoughts:

Optimal if the framework handles device initialization & memory, but I will provide a simple interface which will initialize the chosen devices automatically. There is no need for new code on this, I will copy the existing initialization from OpenCL Caffe.
Additionally, I can provide functions to handle device memory (allocation & transfer), but the more optimal path would be to do it in the framework (can reuse memory & be more efficient).
Initially, the convolution kernel builder will support AMD & nVidia chips, both OpenCL and CUDA. In a second development phase (starting from 15th of July or whenever the SK GT laptop arrives), I will add support for ARM Mali and Intel Skylake, probably with the help from @gongzg.
I think optimally the kernels can all be generated in the same way that it is now regarding convolutions. The advantage here is that the kernel structure can be modularized, and the autotuner can pick & match parts for optimal performance. However if a kernel is vastly different (FFT, non-GEMM methods?) it will be a separate kernel generator.

bhack commented 8 years ago

I'm not convinced with the double approach of the first bullet point. If we want that "clients" have the responsibility of memory and device handling I don't think that the simplified interface makes sense in libdnn. I think that maintaining both approach in the API can confuse users.

naibaf7 commented 8 years ago

@bhack I think not, and here is how I plan to avoid confusion: The convolution library interface will stay the same. I will just allow the device-class constructor to be used without previously initialized OpenCL+ViennaCL respectively CUDA context. Furthermore, I will add a method to allocate and copy memory in the device class. That's it, and is not adding confusion I believe.

bhack commented 8 years ago

So do you mean "Optimal if the framework handles device initialization & memory" in term of this simplified interface call? Probably could we still select an abstracted device list to use client side?

naibaf7 commented 8 years ago

@bhack Well, in case the framework already uses CUDA or OpenCL devices for other kernels/methods/libraries like cuDNN, cuBLAS, clBLAS, CLBlast, it does make much more sense to just use the CUDA or ViennaCL context from an EXISTING context and create a LibDNN-device without initializing a device again... Just if the framework ONLY needs OpenCL/CUDA for libDNN, then the "simplified" interface will be useful, because it will handle everything.

Thus also the reason why both options are needed.

bhack commented 8 years ago

Probably could be a solution to support both use cases. But we need to think how to retrieve the list of devices to let clients select what kind of devices wants to use with the simplified interface.

naibaf7 commented 8 years ago

@bhack Yeah, such a function also already exists in OpenCL Caffe, I'll just offer it in libDNN as well.

bhack commented 8 years ago

/cc @culurciello

keryell commented 8 years ago

@bhack Yes of course we have many people at Xilinx interested in DNN libraries in general and this one too...

bhack commented 8 years ago

@keryell If you want add someone from Xilinx to this thread you are welcome.

bhack commented 8 years ago

/cc @benoitsteiner if you want to share us some TF design needs.

hughperkins commented 8 years ago

@bhack re: gpu-experiments. I am learning :-) Better late than never I reckon :-D

bhack commented 8 years ago

@hughperkins Yes surely late but also AMD (imho one of the big CL player with Intel) quite changed its strategy on Deep Learning. See a little bit more near to your work

bhack commented 8 years ago

Has anyone benchmarked kernels in HIP C++ language?

bhack commented 8 years ago

/cc @mangupta

hughperkins commented 8 years ago

@bhack So, one of the things I've been learning, is that generally speaking, GPUs tend to obey the laws of physics, That is, it doesnt matter which language you write things in, the execution time is governed by things like how many warps each sm can hold simultaneously, latency on instructions, and latency on memory fetches, for example. Things tend to run the same speed in all languages, on the whole.

What do you see as advantages to hcc, in terms of performance?

bhack commented 8 years ago

It not was simply in term of performances cause I've not tested it. But there are some interesting things in the FAQ

hughperkins commented 8 years ago

Yes. So, basically, as far as I know, if you came across a brand-new library, written in CUDA, hcc should make it dead-easy to port to hcc, since hcc is kind of ... similar :-P ... to CUDA. However, it wont magically make things faster or slower, and, specifically, hcc does not in itself contain an implementatoin of cudnn as far as I know?

bhack commented 8 years ago

It is not only a porting tool but it was not a proposal. So on: Amd is pushing on HCC on caffe and torch forks with third party help. Nvidia it is stalled on Opencl 1.2 but it is at the head of Opencl/Kronos and cudnn it is still the standard de facto acceleration API. If we could create this alternative with Opencl 1.2 it is OK but I've seen that also porting Neon kernels has created some issue using only Opencl 1.2 features. Or not?

hughperkins commented 8 years ago

I've seen that also porting Neon kernels has created some issue using only Opencl 1.2 features. Or not?

I would say: Not :-). Basically:

on NVIDIA, we can use inline assembler, so it's easy to use ballot, popcnt etc, zero effort
on all other devices we can use whatever is available on those devices, eg OpenCL 2.0 sub-groups and so on. Currently, ballot is not currently supported in OpenCL 2.0 though. Looks like AMD GCN ISA does not actually support it? http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf

It's pretty easy to mix and match different OpenCL library versions according to the current hardware targeted. I long ago (well, ok, in last couple of months...) gave up on the idealistic idea that one single kernel could run unmodified everywhere. Think of it more like writing a webpage that will work across IE, Firefox, etc ...

bhack commented 8 years ago

@naibaf7 Can you recap a possible roadmap where you'll we be operative again?

naibaf7 commented 8 years ago

@bhack I am back at work. Currently I have to reinstall my platforms (GTX 1080 just arrived, as well as a platform for Intel Iris (HD 540)). I will post a roadmap shortly :)

bhack commented 8 years ago

@naibaf7 Do you will have also an Intel Knights Landing?

naibaf7 commented 8 years ago

@bhack Knights Landing seems to be more in the MKL/CPU space and does not do well (also has limited support) in regards to OpenCL. So no, it is not planned to optimize or work on Knights Landing.

I have enough work upcoming with FP16, Mali/ARM, Intel SKGT 540, Pascal/Hawaii/Polaris as it is :)

bhack commented 8 years ago

Probably in Q3 MKL-DNN will be open sourced... we will see.

hughperkins commented 8 years ago

Probably in Q3 MKL-DNN will be open sourced...

Wow. Thats ... surprising. I thought MKL was one of Intels' major pieces of IP? I remember that one of the advantages of using matlab was that it had MKL as part of it.

bhack commented 8 years ago

But I think that with mkl dnn they will open source only the dnn layers.

hughperkins commented 8 years ago

Ah, MKL-DNN != MKL. I see.

bhack commented 8 years ago

Every player has its own closed source near metal plan. So we are quite "alone" if we want to go ahead with this :unamused:

hughperkins commented 8 years ago

Every player has its own closed source near metal plan.

I think that's what they need really... you can get quite far in OpenCL. But, have you seen the level of detail that @scott-gray went into for his winograd kernels? this describes his gemm implementation, not his winograd implementation, but I think it's fair to say it's likely representative of his winograd implementation. It looks like registers themselves are arranged in banks, and have bank conflicts, and he spends considerable time using registers in a specific order, in order to avoid such conflicts. https://github.com/NervanaSystems/maxas/wiki/SGEMM#calculating-c-register-banks-and-reuse Such optimizations are entirely out of reach of OpenCL, at least ,without an optimizing compiler capable of computing them automatically, which is apparently pretty hard to do, otherwise NVIDIA would already have done it.

bhack commented 8 years ago

And this are some details of his Winograd implementation. Probably we will need machine learning in compilers :grin:

hughperkins commented 8 years ago

And this are some details of his Winograd implementation.

Ah, thats new information. Thanks! :-)

Probably we will need machine learning in compilers :grin:

Yes, maybe :-) Even if there are complicated underlying mechanisms for certain behaviors, it seems plausible that with sufficient data, a network, or similar, might be able to empirically learn reasonable guidelines for predicting the effect of various layouts, sequencing, and so on.

bhack commented 8 years ago

@hughperkins I think that @dividiti probably has something to tell us on this topic. See also http://arxiv.org/abs/1511.02490

bhack commented 8 years ago

/cc @gfursin

hughperkins commented 8 years ago

http://arxiv.org/abs/1511.02490

Interesting. Although their baseline seems pretty naive? ie, their baseline is fixed-size workgroups. I would expect them to use eg Cedric's autotuner as a baseline/comparison for example? Although they do reference clTune, they dont seem to provide any benchmarks relative to it?

bhack commented 8 years ago

This is part of a big movement. See http://ctuning.org/cm/wiki/index.php?title=Reproducibility

bhack commented 8 years ago

/cc @gstoner

edgarriba commented 8 years ago

@naibaf7 As mentioned in https://github.com/BVLC/caffe/pull/4421#issuecomment-235021339 by @CNugteren maybe it's a good idea to rely on https://github.com/CNugteren/CLCudaAPI and focus libdnn as a pure kernels library in order to take advantage of their advanced work.

Besides, I'd like to propose to embed libdnn in a header to make it more portable. What do you think?

naibaf7 commented 8 years ago

@edgarriba I do not want to port it over to CLCudaAPI at this time, especially since I'm not certain yet in which direction OpenCL Caffe will go at this time, but also because the autotuning, kernel launching and device capability checks are very tightly coupled in there.

As for the header-only: Yes that's an option. What are possible advantages and disadvantages here? ViennaCL and Boost are built as header-only and work well...

edgarriba commented 8 years ago

@naibaf7 You know more the low levels details but it's a pity because there's a lot of useful stuff there that probably will be reimplemented (or at least I was thinking to do it). Hope we can balance it with Caffe implementation.

Regarding header-only, for me just having something to directly include with a header it's like a hand in a glove. However, all that glitters is not gold. Check this discussion: http://programmers.stackexchange.com/questions/305618/are-header-only-libraries-more-efficient

bhack commented 8 years ago

@naibaf7 We need to know how to go ahead time it is precious on GSoC if we consider libdnn upstream.. Why we cannot start introducing this here? https://github.com/BVLC/caffe/pull/4421#issuecomment-235064468 here

naibaf7 / libdnn

Extended goals #2