Closed jiayisunx closed 6 months ago
4X latency is massive. You don't need any extra setup to achieve this feat, right? If so, it would be nice to document them as thoroughly as possible.
Yes, pure, native PyTorch env. The optimization of dynamic int8 quantization has a functionality issue on CPU, which we are still working on. We may show more experiment results after all optimizations can be applied.
Okay then we'd want to make that explicitly clear from the README. Otherwise, it's still incomplete in my opinion.
Okay then we'd want to make that explicitly clear from the README. Otherwise, it's still incomplete in my opinion.
Can you be more specific about what you would like me to add to README?
The current changes are fine except for we're not specifying what you told me here: https://github.com/huggingface/diffusion-fast/pull/13#issuecomment-2112234531
So this makes it incomplete.
The current changes are fine except for we're not specifying what you told me here: #13 (comment)
So this makes it incomplete.
I have written these optimizations (BFloat16, SDPA, torch.compile, Combining q,k,v projections) can run on CPU platforms, not included the optimization of dynamic int8 quantization here.
Sorry that was my oversight. Thanks!
4X latency is massive. You don't need any extra setup to achieve this feat, right? If so, it would be nice to document them as thoroughly as possible.