Some questions regarding the 6-8 bit models?

ghost commented 9 months ago

I had a few questions:

I am getting a huge difference in performance when i load sd_xl_base_1.0_f16.ckpt vs sd_xl_base_1.0_q6p_q8p.ckpt. The codec in openStore is not set in both cased. The memory used is also the same in both cases. But when i load sd_xl_base_1.0_q6p_q8p, its much faster. Why is the case? I thought if i dont use any codec then it would just cast all the weights in fp16 format

liuliu commented 9 months ago

not providing codec would not cast to fp16 actually (if it is any of these ezm7, qXp). Only fp32 / fp16 has this seamless conversions happen. What happens is it will try to load as fp16 / fp32, and most likely fail, at that time to initialize the weights with the builtin initialization scheme (I believe it is either He-init, or Xavier-init).

It feels faster probably because these init scheme will result in nan for deeper networks and then the shader took a shortcut?

We should introduce a "strict" mode to make it easier (so we can notice some of these parameters are not loaded). All these are originated from s4nnc is a training framework in the beginning, so it is common to load part of the weights and then init the rest.

ghost commented 9 months ago

Thanks, yes this seems to be the case.

If i want to use return .final( tensor ) to load 6bit weights, how to go about it? I am assuming tensor has to be on CPU and should be 6bit already.

liuliu / swift-diffusion

Some questions regarding the 6-8 bit models? #53