Closed ghost closed 1 year ago
Are the 6bit quantized models just for reducing the model space taken on disk? or does it lower the GPU ram used on M1? Are there any speed improvements if we use 6bit SDXL?
Yeah, it reduces space on the disk and (latest versions) RAM usage (not released, but in s4nnc now). But we were unable to harvest any speed improvements at the moment. User reported speed improvements are very specific for SDXL: because it reduces RAM usage, 8GiB Macbook will not swap, hence significantly reduced the latency. It has no impact on iPad due to better RAM management scheme.
MFA is merged a while ago. The merge happened on ccv / nnc repo.
How to use MFA with Unet? is there an example?
Hey,
I know that we can use 6 bit quantization with nnc. For SDXL, what is the perfomence improvement? Are the 6bit quantized models just for reducing the model space taken on disk? or does it lower the GPU ram used on M1? Are there any speed improvements if we use 6bit SDXL?
Also, is the Metal FlashAttention merged?