huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.08k stars 879 forks source link

Metal Bug moving image from GPU to CPU that Hangs the whole system. #2369

Open super-fun-surf opened 1 month ago

super-fun-surf commented 1 month ago

Using the stable diffusion example running SDXL on CUDA vs Metal. Creating the image on a RTX 4000 ADA using Cuda takes about 1 second per step. Creating the image on M1 with 16GB shared its about 10x slower at 16 seconds per step. Since the GPU is not fully maxed out on the Metal yet, this makes sense, however there seams to be a bug when transferring the image from the GPU back to the CPU.

On the CUDA machine it takes 0.149 seconds and on the M1 it takes anywhere from 36 seconds to 400 seconds and it completely freezes the host OS.

I made a branch with a timer in place at https://github.com/AIFX-Art/candle/tree/gpu-timing

CUDA:

cargo run --release --features coda,cudnn  --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16

runs and outputs

Tensor[dims 2, 77, 2048; f16, cuda:0]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 1.24s
step 2/8 done, 0.68s
step 3/8 done, 1.01s
step 4/8 done, 1.01s
step 5/8 done, 1.01s
step 6/8 done, 1.01s
step 7/8 done, 1.01s
step 8/8 done, 1.01s
Generating the final image for sample 1/1.
Image to CPU 0.14912221s

Metal:

cargo run --release --features metal  --example stable-diffusion -- --sd-version xl --n-steps 8 --width 1024 --height 1024 --use-f16

runs and outputs

Tensor[dims 2, 77, 2048; f16, metal:4294969334]
Building the autoencoder.
Building the unet.
starting sampling
step 1/8 done, 3.68s
step 2/8 done, 14.28s
step 3/8 done, 16.89s
step 4/8 done, 16.22s
step 5/8 done, 16.73s
step 6/8 done, 17.44s
step 7/8 done, 16.68s
step 8/8 done, 16.81s
Generating the final image for sample 1/1.
Image to CPU 46.37571s
LaurentMazare commented 1 month ago

The tricky bit when profiling what happens on GPUs is that the apis are async (for both cuda and metal), so you see most of the time being spent in the final data transfer whereas it may be any op that is performed before that that actually takes most of the time. Cuda has some very nice profiling tools for that but on metal it's pretty annoying to do, we have a device.capture(path) but it's only able to handle a very small computations.

super-fun-surf commented 1 month ago

ahh very interesting, I see what you mean. for example I just enabled intermediary images and the timer shows the denoising dropping from 16 seconds to only 4 seconds, but the gpu to cpu transfer is taking 13 seconds. so the whole operation is about 1 second longer but its not apparent from the timer what's taking time....

super-fun-surf commented 1 month ago

Metal usage right now makes the MacOS super Laggy and completely freezes the system many times during the image generation process. Tested on M1 and M2.

What is the process for determining the problem, it seams like a serious memory issue or something pretty deep as it hangs the whole computer.