Code looks like QKMatmul and SMVMatmul calculates with full precision matrix multiplication.

hatchetProject / QuEST

QuEST: Efficient Finetuning for Low-bit Diffusion Models

26 stars 2 forks source link

Code looks like QKMatmul and SMVMatmul calculates with full precision matrix multiplication. #11

Closed seung-hoon-lee closed 3 weeks ago

seung-hoon-lee commented 3 weeks ago

First of all, thank you for your awesome work with the opensource code. However, I guess there is some confusion that you mentioned in issue #10

"commenting line 367~371 in qdiff/quant_block.py when constructing the model (I will do this modification in the repo)"

With this modification, it looks like the matrix multiplication for QKMatmul and SMVMatmul is calculated as full precision. -> No activation quantizer for QKMatmul & SMVmatmul by following code modification.

I believe using the QKMatmul and SMVMatmul classes from the Attentionblock is the right approach since the attention operation is resource-intensive. Without using QKMatmul and SMVMatmul that you mentioned at #10, the QKVAttentionLegacy class from the path : /ldm/modules/diffusionmodules/openaimodel.py calculates the attention multiplication as full precision.

Please let me know if there are any parts that I am misunderstanding.

hatchetProject commented 3 weeks ago

Hi, the attention matrix multiplication quantization is done via the QuantBasicTransformerBlock module, which includes the qk-matmul and smv-matmul operation (I commented them because they may cause confusion since they exist but do not act as a useful function). Please refer to line 232 of qdiff/quant_block.py for more details.

seung-hoon-lee commented 3 weeks ago

Thank you for your reply.

I noticed that quantized model with stable-diffusion model seems that it includes QuantBasicTransformerBlock module, which includes qk-matmul and smv-matmul, so it looks fine in sd model.

However, in perspective of LDM model(LDM church,bedroom), it looks like the model doesn't use QuantBasicTransformerBlock, so with the commented code, the attention block structure of quant LDM model looks like follows

In this structure, qkv_matmul and smv_matmul uses QKVAttentionLegacy class from the path : /ldm/modules/diffusionmodules/openaimodel.py, which is full-precision matrix multiplication.

In summary, adding comments does not seem to be an issue in the sd model, but it appears to be problematic in the ldm models.

Please let me know if there are any parts that I am misunderstanding.

hatchetProject commented 3 weeks ago

I see, thanks for pointing it out! I have provided a wrong solution in issue #10. I will clarify this mistake in issue #10 and provide another solution.

The original implementation (before commenting the lines in get_specials) should be fine. Thanks again!