support for trust_remote_code / 8k context

henk717 / KoboldAI

KoboldAI is generative AI software optimized for fictional use, but capable of much more!

http://koboldai.com

GNU Affero General Public License v3.0

367 stars 130 forks source link

support for trust_remote_code / 8k context #410

Open BlairSadewitz opened 1 year ago

BlairSadewitz commented 1 year ago

Hello,

There are a number of models I'd like to try which require this. I know that I asked you about this in the past, and IIRC you mentioned that you removed it because you wanted to implement it properly. In the interim, would you kindly instruct me on what I have to change in order to pass this flag to the appropriate call(s) (you don't have to do it for every conceivable situation/type of model, just for hf or hf_torch or whichever is necessary (16-bit, don't worry about loading in 8 or 4 bit) to load e.g. llama-based models, maybe falcon, etc. I'd just as happily patch transformers itself; whatever gets it to work. I'm mostly trying to load the models with increased context size.

Thanks.

BlairSadewitz commented 1 year ago

Being able to use a monkey patch would be cool, too, but I assume that's even more work.

BlairSadewitz commented 1 year ago

What I am most interested in is being able to use models which use this:

https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch-16k.py

Most of them are 8k.

https://huggingface.co/TheBloke/airoboros-33B-gpt4-1-4-SuperHOT-8K-fp16/tree/main

henk717 commented 1 year ago

This is planned as a seperate addon but currently unfinished.

BlairSadewitz commented 1 year ago

Oh, OK, fair enough. Whenever you have a spare moment, would you kindly tell me where in the code the call is which loads a 16-bit llama-based model (you know, that I'd download from HF) is so I could just rig it myself to work? Whenever I have the time, I will figure out how to use python to just tell me the line number. If that happens before you get around to replying to this, I'll close out the PR. It could be either the code in KoboldAI or the code in transformers itself, I don't care which.

henk717 commented 1 year ago

The easiest way to do it is with our Basic HF backend since there it will be in the from_pretrained lines, in the main backend its quite complicated. The hold-up is that the Basic HF backend is unfinished and unstable, so your milage may strongly vary.

BlairSadewitz commented 1 year ago

Hmm, yeah, I'm having some issues with it. :(

Check this out, though: RoPE scaling got merged to transformers. Models don't have to be pretrained to use it, though apparently you lose accuracy if they aren't. Maybe you'd want to add support for this at some point? It works for gptneox, too, according to the chatter online.

https://github.com/huggingface/transformers/commit/34d94094279d2c903d9d8a51a65edb265f22c849#diff-9ba75cc28be7924a2fc43de1d2c8c7779ad597129d33d1af39153951463cd0bc

Also, there's this:

https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

The patch is three lines. That code ameliorates the decrease in perplexity. Here's a colab:

https://colab.research.google.com/drive/1VI2nhlyKvd5cw4-zHvAIk00cAVj2lCCC#scrollTo=b80b3f37

BlairSadewitz commented 1 year ago

I just noticed everything you merged. Thanks! I'd been hopping between forks, and this makes my life a lot easier.

BlairSadewitz commented 1 year ago

In case you aren't aware, transformers now has support for rope scaling.

https://huggingface.co/docs/transformers/main/model_doc/llama#transformers.LlamaConfig

henk717 commented 1 year ago

We automatically use rope scaling if its present in a models config. Manual control for it is planned.

BlairSadewitz commented 1 year ago

Ooh, nice. That makes my life a lot easier.

Incidentally, I stumbled upon this:

https://github.com/jquesnelle/scaled-rope

Basically, it builds a wheel with the necessary code to support all these different scaling methods along with patch functions, e.g.

def patch_llama_for_linear_scaled_rotary_embeddings(model, scale): from .LlamaLinearScaledRotaryEmbedding import LlamaLinearScaledRotaryEmbedding for each in model.model.layers: each.self_attn.rotary_emb = LlamaLinearScaledRotaryEmbedding( each.self_attn.head_dim, scale=scale, device=each.self_attn.rotary_emb.inv_freq.device)

I found it because I had problems loading some different models because of the layers, which it takes care of.