oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.58k stars 985 forks source link

Primitive cache design #609

Closed pinzhenx closed 4 years ago

pinzhenx commented 4 years ago
  1. Why the cache is designed for primitive, not pd?

My guess is, probably some pd's are pretty lightweight and easy to create. However, I noticed that the conv pd creation time is non-neglegible to some extend.

If I comment out the calls to dnnl_primitive_execute and compare the preparation time before entering primitive::execute, I found the preparation was dominated by the conv pd creation:

Screen Shot 2019-12-03 at 10 43 23 PM

I feel like caching the primitive desc also makes sense, as the primitive and pd are 1:1 in most cases, except that we need to create a dummy pd to query some info.

  1. (minor) Can we get the primitive cache key?

If we could get the keys for primitives, we will have the opportunity to cache the input/output buffers in the framework to eliminate allocations, or we won't be able to hold these buffers as we have no key to fetch them out.

This FR is not that important but there might still be a potential performance issue if we completely rely on the DNNL primitive cache without caching any buffer.

emfomenk commented 4 years ago

Hi @pinzhenx,

Thanks for the questions!

Why the cache is designed for primitive, not pd?

There are few reasons for that:

I found the preparation was dominated by the conv pd creation:

I agree that we don't optimize pd-creation much, so I am not surprised that commenting out the execution the pd-creation appeared in VTune report in the upper positions :) But I wonder if this creation indeed affect any real world examples you know.

(minor) Can we get the primitive cache key?

Good question. Currently we have internal comparison operator for pd that allows using it as a key in primitive cache. In theory we could expose this comparison operator and on the FWK side you will use this pd as a key. However, I think you want the key to be something more serializeable. like std::string or something. Not how expensive it would be though (given that pd contains a lot of information). Your ideas are welcome.


In general, I want to note that we are still exploring the ways to implement the primitive cache. In particular we consider caching not the primitive itself, but rather only JITted kernels. This will allow better memory reusage + we won't cache some simple implementations that don't use jit (hence have zero cost of creation). This will also allow reuse cache between engines (not very relevant to CPU though). However, there are downsides as well: non obvious life-time that the library needs to control somehow, more expensive implementation, slightly bigger overhead on creating primitive (as less things are cached).

@densamoilov is working on all that, but he is currently out for a few weeks. But anyways, we will be glad to discuss the thoughts and ideas with you :)

vpirogov commented 4 years ago

Thanks to @emfomenk in v1.2 the primitive descriptor creation is significantly less expensive. There are cases where it's noticeable though and we are considering caching for this part as well.