Quanto Quantized KV cache results in very high latency

kazunator commented 2 days ago

System Info

transformers version: 4.46.0.dev0
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.24.7
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu121 (True)
Tensorflow version (GPU?): 2.17.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.5 (gpu)
Jax version: 0.4.33
JaxLib version: 0.4.33
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@zucchini-nlp

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Pip installations:

!pip install -q git+https://github.com/huggingface/transformers
!pip install datasets accelerate 
!pip install -q flash-attn --no-build-isolation
!pip install quanto

Code:

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
import psutil
import gc

def get_gpu_memory():
    return torch.cuda.memory_allocated() / 1024**2  # Convert to MB

def get_ram_usage():
    return psutil.Process().memory_info().rss / 1024**2  # Convert to MB
# Memory usage before generation
gpu_memory_before = get_gpu_memory()
ram_before = get_ram_usage()

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-Instruct", torch_dtype=torch.float16, device_map="auto")

tokenizer.pad_token_id = tokenizer.eos_token_id
dataset = load_dataset('THUDM/LongBench', "samsum", split='test')
very_long_context = " ".join(dataset["context"])
inputs = tokenizer(very_long_context, max_length=10000, truncation="only_first", return_tensors="pt").to(model.device)

generation_kwargs = {"do_sample": False, "temperature": 1.0, "top_p": 1.0, "max_new_tokens": 20, "min_new_tokens": 20, "cache_implementation": "quantized"}

# Time the generation
start_time = time.time()

out_fp16 = model.generate(**inputs, **generation_kwargs)
generated_text = tokenizer.batch_decode(out_fp16)

end_time = time.time()

# Memory usage after generation
gpu_memory_after = get_gpu_memory()
ram_after = get_ram_usage()

# Calculate differences
gpu_memory_used = gpu_memory_after - gpu_memory_before
ram_used = ram_after - ram_before
time_taken = end_time - start_time

print(f"Generated text: {generated_text}")
print(f"Time taken: {time_taken:.2f} seconds")
print(f"GPU memory used: {gpu_memory_used:.2f} MB")
print(f"RAM used: {ram_used:.2f} MB")

Expected behavior

This code should result in a generation time of around 16 seconds, but it takes 80 to 100 seconds.

kazunator commented 2 days ago

This code also results in a high latency which made me and @zucchini-nlp suspect that it's a quanto issue

from quanto import AffineQuantizer, MaxOptimizer, qint2, qint4
import time
import torch

dummy_tensor_inputs = torch.randn(1, 32, 10_000, 128).to("cuda")
optimizer = MaxOptimizer()
qtype = qint4
q_group_size = 64
axis = 0

# quantize once per layer
for _ in range(16):
    scale, zeropoint = optimizer(dummy_tensor_inputs, qtype.bits, axis, q_group_size)
    qtensor = AffineQuantizer.apply(dummy_tensor_inputs, qtype, axis, q_group_size, scale, zeropoint)

start = time.perf_counter()
for _ in range(16 * 20):
    dequant_tensor = qtensor.dequantize()

end = time.perf_counter()
print(f"Time taken: {(end - start):.2f} seconds")

zucchini-nlp commented 2 days ago

cc @dacorvo can you pls help us understand why quanto is slower for the user, and not for me on DGX machine?

kazunator commented 2 days ago

@zucchini-nlp I wanted to ask you if you're using a quanto version that you've downloaded a long time ago on ur DGX machine or if you're pip installing it again on a virtual environment? Maybe this is a new issue?

zucchini-nlp commented 2 days ago

It is quanto==0.2.0

kazunator commented 2 days ago

Okay, I found where the issue is. When I put .to(model.device) in my code, I expected that to be just cuda. Apparently that's not the case. When I explicitly set it as: .to("cuda"), it generates in a reasonable 16 seconds.

I am confused by the fact that when I ran with .to("cuda") on the second code, it took 80 seconds. So that's still a big question mark, but my issue has been fixed.

I also tested it with llama 70b weight quantized to int4 and it took 180 seconds to generate with an 80k context window. I know that doing both weight quant and kv quant makes it slow, but I didn't expect it to be this slow.

@zucchini-nlp would love it if u can do the same test on ur end and see if u get a similar generation time

huggingface / transformers