Open yorickvP opened 3 weeks ago
Did some H100 benchmarks.
The VAE speed seems reproducible, where the uncompiled VAE spends a lot of time in nchwToNhwcKernel while the compiled version manages to avoid it.
At the same time, I had a cog bug saying output streams failed to drain
, crashing the pod instantly, but this seems unrelated to my PR.
Did you figure out what the output streams failed to drain
issue was? I'm seeing that in prod with our cog deploy as well
@jonluca as I understand it, it was a regression in cog and should be fixed when building with 0.9.25 and later. It was caused by cog replacing stdout/stderr during predictions, but not during setup, causing forked processes to attempt to write to the original stdout/stderr. Should be fixed in https://github.com/replicate/cog/pull/1969 but let me know if it's not!
It takes about 80 seconds on my machine to compile this. Makes the encoding step about 50% faster on A5000 (0.3 -> 0.2s), haven't tried H100.