about winograd batched MM performance

I am also working on cuda winograd.

The schedule for mali gpu cannot be used for nvidia gpu. The main difference is the usage of shared memory. You should implement totally different schedule for both transformation and batch MM. For batch gemm, you can see https://github.com/dmlc/tvm/tree/master/topi/recipe/gemm for example.
For nvidia gpu, if we want to get the best performance, we cannot re-layout the data several times like what we do on mali. Because some stages can be memory bounded on nvidia's gpu. (NVIDIA gpu vs mali gpu, peak FLOPS is about 50~200x, but memory bandwith is only 10x). According to the original paper, we should fuse the transform and batch gemm into a block.
Actually I cannot figure out how to fuse them to get the best performance. The open source code from that paper (neon library) is in asm and I cannot read it. Now I only have some preliminary results. For inference, if we do kernel transformation in advance, our kernel can beat cudnn's best winograd when the kernel tensor is large (such as last few layers in resnet)

What's your background of cuda? It helps a lot if your team can contribute a fast (fused) winograd kernel for cuda.

merrymercy / tvm-mali