Speed benchmark - Githubissues

vince62s commented 8 months ago

Hello guys,

Congrats for the wonderful package / paper.

I ma just curious, before implementing this in OpenNMT-y, if you have somewhere some speed benchmark in tok/sec with other methods given an inference framework whether it's pure pytorch or vLLM or whatever.

Cheers.

NB: I have mixed feelings with torch.compile because the first pass/load it often very slow.

mobicham commented 8 months ago

Hi @vince62s , thanks for your message! I did some benchmarks with the Hugging Face Llama2-7B model and the torch.compile backend for the forward pass seq_len=1024 (no caching). HQQ is actually doing quite well for a pure Pytorch implementation as you can see below:

GPU	Configuration	Execution Time (sec)
Titan RTX	BNB-NF4	0.288
Titan RTX	HQQ	0.304
A100	HQQ	0.150
A100	HQQ (scale/zero on cpu)	0.175
A100	QUIP# 4-bit	0.253
A100	QUIP# 2-bit	0.353

Training with LoRA was quite close to BNB (35 minutes vs. 32 minutes), but that's using fp4 not nf4 which is slightly faster but worse in quality.

Regarding the pytorch compile note, you can do a warm-start: with torch.no_grad(): out = model(torch.ones((1, seq_len), dtype=torch.int32, device='cuda')) I think there's even a way to cache the generated kernels for different gpu architectures, this way it doesn't have to generate them on the fly. We are still exploring which part should be compiled because we need kernels that work for both the forward and the backward pass.

VLLM is about 5x faster, that's why the architecture plays a major role (merging layers, better layer implementations, etc.)

There are some Triton kernels for HQQ available, unfortunately they seem to be slower on older GPUs.

We are currently discussing ways to make it faster here: https://github.com/mobiusml/hqq/issues/8

vince62s commented 8 months ago

no inference benchmark using HF for instance vs GPTQ/AWQ ? because BNB is really slow, so I hope inference with HQQ is not as slow as BNB

mobicham commented 8 months ago

I don't have numbers for GPTQ right now but I can take a look at the Hugging Face version. If their GPTQ Triton/CUDA kernels are actually that fast, that's great news because they can be reused for HQQ, the dequantization step is similar.

The thing to keep in mind is that, the packages that do quantization (Auto-GPTQ, Quip#, etc.), they don't use the exact same HF architecture, they do other things to make the model faster like merging QKV attention layers and MLP layers, etc. For example, you can see that here: https://github.com/PanQiWei/AutoGPTQ/blob/d2662b18bb91e1864b29e4e05862712382b8a076/auto_gptq/modeling/auto.py#L88C17-L88C17 So comparing something like Auto-GPTQ with the original HF model is not a fair comparison, but we are discussing potentially adding things like that via Unsloth in https://github.com/mobiusml/hqq/issues/8

I found this benchamrk, and it seems like the exllamav2 kernel is pretty fast: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

vince62s commented 8 months ago

Well this is the whole point. exllamav2 is fast but in fact if you optimize awq like here: https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#benchmarks you can go even faster with some cuda kernels + fast RMS kernel. Fusing does not help so much. So the question is, whatever the backend is, as of today what kind of inference tok/sec can you achieve with Llama7b-v2 or Mistrla?

mobicham commented 8 months ago

Sure, the architecture optimization stuff and Triton/CUDA kernels is being discussed in https://github.com/mobiusml/hqq/issues/8 . Since many projects like Unsloth/VLLM/ExLlama/etc. have already done the work to optimize the models that we can re-use, we have been focusing mainly on the quantization algorithm. The speed is useless if the quantized model outputs garbage :)

Regarding your question, I answered that in the table above in https://github.com/mobiusml/hqq/issues/11#issuecomment-1881647120 As of now, the inference time with the Hugging Face model implementation (no caching) and sequence length of 1024 is about 0.150 sec on the A100 and 0.304 sec on the Titan RTX. With the VLLM backend, inference should be about 5x faster. Text generation should be faster when caching is used.

mobicham commented 8 months ago

I was able to use the AutoAWQ Cuda kernels with HQQ data, and it's actually slower than HQQ with torch.compile() on the Titan RTX (numbers below for batch_size=1 / seq_len=1024). I had a similar experience with AutoGPTQ: without layer fusion it was slower than HQQ with torch.compile(). That means that the speed-up is mainly coming from layer fusion + the other kernels for layernorm/attention/etc.

Backend	Time (sec)
PYTORCH_COMPILE	0.304
AutoAWQ - GEMV	0.499
AutoAWQ - GEMM	0.572
HQQLinearTritonSavable	1.987

vince62s commented 8 months ago

I don't know what exactly you measure, but if my understanding is correct, you measure the "prefilling" which is reported in this table https://github.com/casper-hansen/AutoAWQ?tab=readme-ov-file#benchmarks 0.5 sec for 1024 tok would mean prefilling at 2000 tok/sec which is in line with the table above. however you can see there is no direct correlation with subsequent tok/sec of the following decoding speed, which is the most interesting. I am not saying HQQ is slower but that would be good to have an comprehensive benchmark.

mobicham commented 8 months ago

Sure but that's measuring something different. Since this is a quantization package not an inference engine, we are interested in measuring the forward pass without caching for the exact same model architecture because the main question is: which dequantize() -> matmul is faster not which model implementation/decoding logic is faster. All the rest that is independent of the quantization logic can come later. I do agree that the current HF setup is slow, that's why we added VLLM, and it would be great to re-use stuff from AutoAWQ/Unsloth/etc., there's def value in that! Unsloth is particularly interesting because it supports also much faster training, so we are actively looking into that!

vince62s commented 8 months ago

you can also have a look at this table: https://github.com/PanQiWei/AutoGPTQ/pull/484 and see that this is not as simple as testing one case (batch size 1 / seqlen 1024) anyway, I'll give it a try and compare, "everything else being comparable apple to apple"

mobicham commented 8 months ago

Yes, the impact of the dequantization step should actually be lower as the batch size and seqlen increase. Also the type of the GPU can have an impact. Some kernels are tuned for the A100 and once you use them on older GPUs they can run slower than vanilla Pytorch.

Sure, that means replacing the linear layers in the model without touching the rest. You can do this via HQQ patching functions (Llama + Mixtral supported for HF):

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model     = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
tokenizer = AutoTokenizer.from_pretrained(model_id,       use_auth_token=hf_auth, cache_dir=cache_path) 

def patch_linear(linear_layer, quant_config):
    if(quant_config):
        new_layer = do_quantization_stuff(linear_layer, quant_config)
    else:
        new_layer = linear_layer
    return new_layer

#define your quant_config parameters to quantize a linear layer
quant_config = ...
linear_tags = model.base_class.get_linear_tags()
patch_params = dict([(tag, quant_config) for tag in linear_tags])
model.base_class.patch_model(model, lambda l: l.half().cuda(), patch_linear, patch_params, verbose=True)

mobicham commented 5 months ago

Closing this since now HQQ runs with fused kernels: 7B models run at ~200 tokens/sec on a 4090.

mobiusml / hqq

Speed benchmark #11