Open kazunator opened 2 days ago
This code also results in a high latency which made me and @zucchini-nlp suspect that it's a quanto issue
from quanto import AffineQuantizer, MaxOptimizer, qint2, qint4
import time
import torch
dummy_tensor_inputs = torch.randn(1, 32, 10_000, 128).to("cuda")
optimizer = MaxOptimizer()
qtype = qint4
q_group_size = 64
axis = 0
# quantize once per layer
for _ in range(16):
scale, zeropoint = optimizer(dummy_tensor_inputs, qtype.bits, axis, q_group_size)
qtensor = AffineQuantizer.apply(dummy_tensor_inputs, qtype, axis, q_group_size, scale, zeropoint)
start = time.perf_counter()
for _ in range(16 * 20):
dequant_tensor = qtensor.dequantize()
end = time.perf_counter()
print(f"Time taken: {(end - start):.2f} seconds")
cc @dacorvo can you pls help us understand why quanto is slower for the user, and not for me on DGX machine?
@zucchini-nlp I wanted to ask you if you're using a quanto version that you've downloaded a long time ago on ur DGX machine or if you're pip installing it again on a virtual environment? Maybe this is a new issue?
It is quanto==0.2.0
Okay, I found where the issue is. When I put .to(model.device) in my code, I expected that to be just cuda. Apparently that's not the case. When I explicitly set it as: .to("cuda"), it generates in a reasonable 16 seconds.
I am confused by the fact that when I ran with .to("cuda") on the second code, it took 80 seconds. So that's still a big question mark, but my issue has been fixed.
I also tested it with llama 70b weight quantized to int4 and it took 180 seconds to generate with an 80k context window. I know that doing both weight quant and kv quant makes it slow, but I didn't expect it to be this slow.
@zucchini-nlp would love it if u can do the same test on ur end and see if u get a similar generation time
System Info
transformers
version: 4.46.0.dev0Who can help?
@zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Pip installations:
Code:
Expected behavior
This code should result in a generation time of around 16 seconds, but it takes 80 to 100 seconds.