merrymercy / tvm-mali

Optimizing Mobile Deep Learning on ARM GPU with TVM
http://tvmlang.org/2018/01/16/opt-mali-gpu.html
MIT License
179 stars 28 forks source link

about winograd batched MM performance #7

Open janboeye opened 6 years ago

janboeye commented 6 years ago

hi, @merrymercy I am working on winograd on cuda. I found that batched MM in your winograd is slow in nvida architecure. I guest this is because when C is large, it could not use parallel power of GPU.

Do you have any idea about this part?

Thanks

merrymercy commented 6 years ago

I am also working on cuda winograd.

  1. The schedule for mali gpu cannot be used for nvidia gpu. The main difference is the usage of shared memory. You should implement totally different schedule for both transformation and batch MM. For batch gemm, you can see https://github.com/dmlc/tvm/tree/master/topi/recipe/gemm for example.
  2. For nvidia gpu, if we want to get the best performance, we cannot re-layout the data several times like what we do on mali. Because some stages can be memory bounded on nvidia's gpu. (NVIDIA gpu vs mali gpu, peak FLOPS is about 50~200x, but memory bandwith is only 10x). According to the original paper, we should fuse the transform and batch gemm into a block.
  3. Actually I cannot figure out how to fuse them to get the best performance. The open source code from that paper (neon library) is in asm and I cannot read it. Now I only have some preliminary results. For inference, if we do kernel transformation in advance, our kernel can beat cudnn's best winograd when the kernel tensor is large (such as last few layers in resnet)

What's your background of cuda? It helps a lot if your team can contribute a fast (fused) winograd kernel for cuda.