Open super-fun-surf opened 1 month ago
The tricky bit when profiling what happens on GPUs is that the apis are async (for both cuda and metal), so you see most of the time being spent in the final data transfer whereas it may be any op that is performed before that that actually takes most of the time. Cuda has some very nice profiling tools for that but on metal it's pretty annoying to do, we have a device.capture(path)
but it's only able to handle a very small computations.
ahh very interesting, I see what you mean. for example I just enabled intermediary images and the timer shows the denoising dropping from 16 seconds to only 4 seconds, but the gpu to cpu transfer is taking 13 seconds. so the whole operation is about 1 second longer but its not apparent from the timer what's taking time....
Metal usage right now makes the MacOS super Laggy and completely freezes the system many times during the image generation process. Tested on M1 and M2.
What is the process for determining the problem, it seams like a serious memory issue or something pretty deep as it hangs the whole computer.
Using the stable diffusion example running SDXL on CUDA vs Metal. Creating the image on a RTX 4000 ADA using Cuda takes about 1 second per step. Creating the image on M1 with 16GB shared its about 10x slower at 16 seconds per step. Since the GPU is not fully maxed out on the Metal yet, this makes sense, however there seams to be a bug when transferring the image from the GPU back to the CPU.
On the CUDA machine it takes 0.149 seconds and on the M1 it takes anywhere from 36 seconds to 400 seconds and it completely freezes the host OS.
I made a branch with a timer in place at https://github.com/AIFX-Art/candle/tree/gpu-timing
CUDA:
runs and outputs
Metal:
runs and outputs