replicate / cog-flux

Cog inference for flux models
https://replicate.com/black-forest-labs/flux-dev
Apache License 2.0
272 stars 28 forks source link

torch.compile ae.decode #25

Open yorickvP opened 3 weeks ago

yorickvP commented 3 weeks ago

It takes about 80 seconds on my machine to compile this. Makes the encoding step about 50% faster on A5000 (0.3 -> 0.2s), haven't tried H100.

yorickvP commented 3 weeks ago

Did some H100 benchmarks.

flux-schnell 1 image, VAE not compiled

flux-schnell 4 images, VAE not compiled

flux-schnell 4 images, VAE compiled


The VAE speed seems reproducible, where the uncompiled VAE spends a lot of time in nchwToNhwcKernel while the compiled version manages to avoid it.

At the same time, I had a cog bug saying output streams failed to drain, crashing the pod instantly, but this seems unrelated to my PR.

jonluca commented 1 week ago

Did you figure out what the output streams failed to drain issue was? I'm seeing that in prod with our cog deploy as well

yorickvP commented 1 week ago

@jonluca as I understand it, it was a regression in cog and should be fixed when building with 0.9.25 and later. It was caused by cog replacing stdout/stderr during predictions, but not during setup, causing forked processes to attempt to write to the original stdout/stderr. Should be fixed in https://github.com/replicate/cog/pull/1969 but let me know if it's not!