liuliu / swift-diffusion

BSD 3-Clause "New" or "Revised" License
423 stars 33 forks source link

Why so slow? #22

Closed iamgeo92 closed 1 year ago

iamgeo92 commented 1 year ago

Maple-diffusion ( the MPS inplementation) is like 1-1.3s per iteration on 8GB M1 mac. But your implementation takes 90 secs for 30 steps.

I was wondering what is lacking in this codebase? and how can we get speed similar to Draw Things app for M1.

Thanks

liuliu commented 1 year ago

Checkout this branch: https://github.com/liuliu/swift-diffusion/tree/liu/nhwc

Main thing is to move from NCHW layout (more efficient on CUDA) to NHWC layout (more efficient on Apple hardware).

machineko commented 1 year ago

Main thing is to move from NCHW layout (more efficient on CUDA) to NHWC layout (more efficient on Apple hardware).

For someone reading this in the future CUDA is also faster on NHWC layout 🐎 https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout

liuliu commented 1 year ago

For someone reading this in the future CUDA is also faster on NHWC layout 🐎 https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout

That's good to know! I should do more testing, probably because I wrongly did NHWC for the weights layout and that doesn't activate winograd properly in conv layer.