turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Codelama support #270

Open ParisNeo opened 1 year ago

ParisNeo commented 1 year ago

Hi, CodeLlama2 is not really working on exllama. The answers are sometimes complete gibbrish.

Can you please upgrade the library to upgrade to the new rope_thea parameter of CodeLlama ?

Best regards

turboderp commented 1 year ago

I've added this with the latest commit. I haven't thoroughly tested it, but it's a small change, just reading the value from the config and applying it. Let me know if there are any issues with it.

Ph0rk0z commented 1 year ago

What will it do with how it is in textgen now?

        if shared.args.alpha_value > 1 or shared.args.rope_freq_base > 0:
            config.alpha_value = RoPE.get_alpha_value(shared.args.alpha_value, shared.args.rope_freq_base)
            config.calculate_rotary_embedding_base()

Will theta be overwritten or will this param get ignored?

turboderp commented 1 year ago

It will calculate the rotary embedding base based on the rope_theta specified in the config file and the supplied alpha value:

        self.rotary_embedding_base = read_config["rope_theta"] if "rope_theta" in read_config else 10000.0

   ...

    def calculate_rotary_embedding_base(self):
        self.rotary_embedding_base = self.rotary_embedding_base * self.alpha_value ** (self.head_dim / (self.head_dim-2))

So if you were using an alpha value of 93 to get the correct theta for CodeLlama (i.e. 1,000,000), now you would use an alpha value of 1 instead for the same result, as long as the config file specifies the correct theta for the model.

Calling calculate_rotary_embedding with an alpha value of 93 after the config has been initialized with the correct theta, it would compound and you'd get a base of around 100M.

Ph0rk0z commented 1 year ago

So with alpha it will need to be changed to set the base to 10000.0 and then with apply it. And if theta is specified just apply it directly.

turboderp commented 1 year ago

No, only if you want to use an incorrect base. The right base for Llama is 10,000 and for CodeLlama it's 1,000,000. The NTK alpha value is relative to that, and the only reason people have been using alpha = 93 is to get the above calculation to yield a base of 1,000,000, which is now read directly from the config instead.

If you want to use an alpha value of, say, 2.4, that now means the same thing for CodeLlama as it does for Llama or Llama2 (i.e. scale the RoPE frequency base by 2.4^(128/126)).

Ph0rk0z commented 1 year ago

Right, but I ran perplexity tests on codellama 34b + airoboros 2.1 lora, the numbers are lower by more than a whole point when you use alpha of ~2.7 than when you set the "correct" base. The divide grew further on longer context. Try it yourself if you have them. So for this one case, at least, it's not good to read the config and auto apply the original theta.

I'm sure this will be true for other tunes off that model in the future depending on what people do when training. The way this is now, I will be forced to use the 1 million base.

I haven't tested what happens with phind, wizardcoder or when the lora is merged. That's still TBD. I read posts people were having trouble replicating phind results and my gut says this is related.

turboderp commented 1 year ago

I'm not sure what's the best approach then. rope_theta is an extension of the model spec, and it seems like the best idea to try to emulate the behavior of the HF LlamaModel, which defines the "correct" frequency base to be 1M in this case. If you're getting better results running the model "out of spec" like this, you can still do that by specifying theta manually, or if you want to influence it just with the alpha parameter, the two are related by the formula above. E.g.:

Theta = 10,000 and alpha = 2.7 would give base = 10,000 * 2.7 ^ (128/126) = 27,429

To get a base of 27,429 with theta = 1,000,000, you'd want alpha = (27,429 / 1,000,000) ^ (126/128) = 0.029

Ph0rk0z commented 1 year ago

I'm not either. That's a lot of gymnastics but most people usually want to run at spec. It's also probably why we aren't seeing so many FT for the 34b, people think the model is shit because they set the wrong settings.

Tested samantha 1.1 with PTB_NEW. It's GGML but same difference.

1e6 = 44.x PPL 27,429 = 43.x PPL 1024 context.

Wish I d/l the GPTQ version instead. Bet there is a more ideal rope base.

ChrisCates commented 1 year ago

@Ph0rk0z, @turboderp, Hey guys, just bumping this issue.

I brought up the discussion about RoPE Base and RoPE Frequency in a previous issue. I'm not sure if it's possible to integrate this type of technique into exllama.

Also, creating exllama generators is pretty fast. So we could also create a simple regression testing script for PPL with a list of alpha and theta values to test that cover all the ideal alpha and thetas.

turboderp commented 1 year ago

The position embeddings aren't a bottleneck really, so it's possible other positional encoding schemes could be just as fast. But I'm always hesitant to jump on any new proposed format or technique, because it's literally impossible to keep up.

Most of them don't hold up, as it turns out, and even when they're sound they don't always catch on, which leads to dead code that can potentially cause problems down the line. So it's not so much a question of performance, but more about how many different features I can realistically implement and maintain, and which ones are actually worth allocating time for.

ChrisCates commented 1 year ago

@turboderp understood. Here is my proposal.

Instead of replacing the current rotary embedding calculation. We have optionality for two. Utilizing rope_alpha and rope_theta for the first calculation and rope_base and rope_frequency for the second. We should change the --alpha flag to --rope-alpha for extra clarity. We should use something like --use-rope-alpha and --use-rope-base to flag for calculation types.

Second, let's just pull the calculations done with RoPE from the llama.cpp repo. This will be easier and faster and given the nature of how rotary embeddings function, should not be a problem.

Third, while not necessary, an additional testing script for PPL and maybe reviewing sample outputs would be nice. Just to see what are the optimal alpha, theta or base and frequency values are. This is up for discussion and should be a separate PR.

I'd be happy to formalize this into a spec now. In terms of implementation. I will take a deep dive in a couple weeks assuming no one else is working on it.