Allow padding data instead of concatenating when generating calibration dataset

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.28k stars 243 forks source link

Hi,

I've seen that code that generates the calibration data just takes a dataframe and concatenates everything. I was wondering if this can affect calibration quality since it misaligns datasets that are in instruction format:

<s> [INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\n Instruction...

The instructions have variable length, and all are concatenated, so the initial tokens (<s> [INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>) stop being at the beginning.

Wouldn't this affect quantization since the positional embeddings would be different as in inference?

For this, I've also tried to pad them manually, but ExLlamaV2Tokenizer will not use the pad token I'm setting to it anyway:

>>> from exllamav2 import ExLlamaV2Config, ExLlamaV2Tokenizer
>>> config = ExLlamaV2Config()
>>> config.model_dir = str(DIR_MODEL)
>>> config.qkv_embed = False
>>> config.prepare()
>>> tokenizer = ExLlamaV2Tokenizer(config)
>>> tokenizer.pad_token = '[PAD]'
>>> tokenizer.pad_token_id = 32000
>>> tokenizer.encode("[PAD]")
tensor([[  518, 29925,  3035, 29962]])

Sorry for the delayed response.

Padding has to be done manually as the padding token is not standardized across models. But I wouldn't recommend this approach anyway since you're only going to calibrate the model to represent the embedding for the padding token accurately, as there's no masking during quantization.

Rather, you should be changing the input IDs from being a single tensor to a list of variable-length tensors. And there's a lot more that would have to be taken into account as well, like finding a good set of standard instruct samples and standardizing the instruct formatting or providing some way to supply a template to the quantizer.

It's not my understanding that this presents an issue, though, so I can't really prioritize it with so many other things to look at.

turboderp / exllamav2

Allow padding data instead of concatenating when generating calibration dataset #209