torch.compile ae.decode

yorickvP commented 3 weeks ago

It takes about 80 seconds on my machine to compile this. Makes the encoding step about 50% faster on A5000 (0.3 -> 0.2s), haven't tried H100.

yorickvP commented 3 weeks ago

Did some H100 benchmarks.

flux-schnell 1 image, VAE not compiled

30ms prepare
355ms denoise-single-item
117ms vae-decode
total: 505ms

flux-schnell 4 images, VAE not compiled

30 ms prepare
4x 355 ms denoise-single-item
3.21s vae-decode
total: 4.69s

flux-schnell 4 images, VAE compiled

30ms prepare
4x 355 ms denoise-single-item
152ms vae-decode
total: 1.62s

The VAE speed seems reproducible, where the uncompiled VAE spends a lot of time in nchwToNhwcKernel while the compiled version manages to avoid it.

At the same time, I had a cog bug saying output streams failed to drain, crashing the pod instantly, but this seems unrelated to my PR.

jonluca commented 1 week ago

Did you figure out what the output streams failed to drain issue was? I'm seeing that in prod with our cog deploy as well

yorickvP commented 1 week ago

@jonluca as I understand it, it was a regression in cog and should be fixed when building with 0.9.25 and later. It was caused by cog replacing stdout/stderr during predictions, but not during setup, causing forked processes to attempt to write to the original stdout/stderr. Should be fixed in https://github.com/replicate/cog/pull/1969 but let me know if it's not!

replicate / cog-flux