Open romix opened 6 years ago
@romix
Thanks a lot for such a quick response!
Yes, it's planned as soon as I'm done with my work on the Caffe branch, which includes finishing Quantization for LibDNN.
Nice! Any hints about the time when it is likely to happen? E.g. this summer, this year?
There are a few issues with dependencies now, since LibDNN now depends on the device abstraction from Caffe. I'm not sure how I will handle this in the standalone version yet.
May be you should introduce a higher-level device abstraction (abstract base class?), which does not depend on anything directly? For integrating libdnn with a specific platform/package one would need to implement a concrete code to map libdnn abstraction to the platform/package abstractions.
BLAS support is not comparable to those libraries, since it only supports the BLAS functions required in the deep learning framework. May be enough for your application, or not. YMMV. Also, FP16 and quantization support is present in LibDNN's own BLAS, which is not available in CLBlas or CLBLast. So it's different and not a drop-in replacement.
Thanks for explaining the differences.
A description of the algorithms will be available in my next publication.
I look forward to reading your paper.
Convolution performance depends a lot on the shape of the kernels etc., I currently saw performance around 3-4 times slower than cuDNN 7.0 i.e. in AlexNet.
OK.
New interesting features or improvements in the making right now: Kernel caching in SQLite, quantization (INT8/INT16), FP16 support and an improved autotuner routine.
Sounds very interesting. IMHO, caching cries for being an abstraction just like a device. Depending on the usage of libdnn, using SQLite may be good or not so good idea.
Auto-tuner improvements (and usage examples) would be very nice. I got the current version of the tuner to run, but I'm not sure how to persist the results or reuse the tuned parameters for the different convolution kernels being generated for the same HW platform. Is it even possible to tune for multiple convolution kernels at once? It doesn't look like libDNN provides a lot of machinery for this, or?
Yes exactly, the autotuner was more of a proof-of-concept thus far. I'm also working with lower-end devices now (Rasperry Pi VC4CL and Mali T740) to check how the autotuner can be made economic, reliable and then again, results will be stored in SQLite. I'm not sure in which cases SQlite would not be optimal: It has good support on all operating systems where you'd want to use LibDNN.
The time frame for pushing standalone LibDNN update would be ~August. Contributions to LibDNN within Caffe (the interface can already be used from there if you are interested in developing apps with LibDNN support) are welcome, as well as suggestions to improvement.
Your library is pretty cool, but looks like it was not updated for a long period of time.
At the same time, the version of libdnn in your Caffe fork seems to be more maintained and even got some new features, like BLAS routines generators, etc.
Could you provide some insight about your plans regarding the standalone libdnn or libdnn in general?
Specifically, it would be nice if you could answer some of the following questions:
Do you plan to update the standalone libdnn, e.g. from the version in your Caffe fork?
What is the status of the BLAS support in the Caffe's version of libdnn? How does it compare to something like CLBlas, CLBlast or CUDA counterparts of those?
Could you provide a brief description of the algorithms you use when producing optimized fused convolution (and other) kernels and how/why they are better/faster than e.g. im2col-based approaches or other well-known implementations of convolutions either in terms of performance or memory consumption? The documentation is pretty sparse currently. If it is based on any specific papers or well-known approaches, it would be nice if you could provide references.
How is libdnn In terms of the convolutions performance compared to the current versions of cuDNN and other well-known implementations. You reported it was very fast, often faster than competitors in the past. Is it still the case, or may be there were some recent achievements that made other implementations faster?
Do you plan to add any new interesting features or improvements? If so, could you describe your them?
Thanks!