turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Is there a way to make compress_pos_emb dynamic? #241

Closed fahadh4ilyas closed 11 months ago

fahadh4ilyas commented 11 months ago

So, I'm trying to make the value of sin and cos changed based on the length of sequence. I found that the value of sin and cos is:

model_with_compress_pos_emb_2.layers[0].self_attn.sin[:,:,::2,:] == model.with_compress_pos_emb.layers[0].self_attn.sin

So, for model with compress_pos_emb value 2, I'm trying to add a step parameter which is when:

  1. input_ids.shape[-1] <= 2048, I set step=2.
  2. 2048 < input_ids.shape[-1] <= 4096, I set step=1.

so that I could do this:

cuda_ext.exllama_ext.rope_(query_states, self.sin[:,:,::step,:], self.cos[:,:,::step,:], past_len, self.config.num_attention_heads, self.config.head_dim)

But, computing perplexity showing that there is nothing changed. Is there any missing step here?

EDIT: I realize that RoPE computation is not in python and it seems that it only get the pointer of the sin and cos. That's why making self.sin[:,:,::step,:] is not used because the pointer is still referencing the whole tensor. My current way to handle it is to make multiple sin and cos based on step at the cost of VRAM.

turboderp commented 11 months ago

Yes, the sin and cos tensors are precomputed in ExLlama.__init__(), with the scale given by config.compress_pos_emb. If you want to use multiple scales you'd have to either modify the CUDA functions that apply the embeddings or create multiple versions of those tensors, e.g. at load time.

But keep in mind that keys and values computed cached for one scale will be invalid for any other, so if you're hoping to use one scale up to 2048 tokens and the switch to another as the generation grows longer, this won't work. You'll have to drop the cache at that point and run inference on the sequence-so-far to build the cache for the new scale.

Moreover, the scale you should use is the one the model is finetuned on. So it's doubtful that you'll get good results with this approach in any case.

fahadh4ilyas commented 11 months ago

Yes, the sin and cos tensors are precomputed in ExLlama.__init__(), with the scale given by config.compress_pos_emb. If you want to use multiple scales you'd have to either modify the CUDA functions that apply the embeddings or create multiple versions of those tensors, e.g. at load time.

But keep in mind that keys and values computed cached for one scale will be invalid for any other, so if you're hoping to use one scale up to 2048 tokens and the switch to another as the generation grows longer, this won't work. You'll have to drop the cache at that point and run inference on the sequence-so-far to build the cache for the new scale.

Moreover, the scale you should use is the one the model is finetuned on. So it's doubtful that you'll get good results with this approach in any case.

Yeah, you are right. I thought we could just use scaling without fine tuning. But the result is not good even though the perplexity value is good and decreasing.