turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.73k stars 214 forks source link

Lora support #55

Open alain40 opened 1 year ago

alain40 commented 1 year ago

Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui...

In some instances it would be super-useful to be able load separate lora's on top of a GPTQ model loaded with exllama.

turboderp commented 1 year ago

I'm going to be looking at LoRAs soon, probably over the weekend. Are there any particular adapters on HF you're interested in, just so I have some reference points?

jmoney7823956789378 commented 1 year ago

Personally I'm trying to get my personally-trained one up on exllama. Unfortunately I haven't even been getting any confirmation that it loads correctly.

turboderp commented 1 year ago

If you merge the LoRA with the original model, convert that to GPTQ and load it in ExLlama, it should be loading correctly. As for loading the LoRA separately, support for that is still pending. But what are you using to train it?

jmoney7823956789378 commented 1 year ago

I tried a small finetune using a separate machine holding an RTX 2080S. (I'm considering using runpod later) Utilized the monkey-patch with a small 7b 4bit model to finetune on a dataset from a certain cyber forensics course (which I cannot name at this time). Loads fine afterwards, but in my testing I'm unable to confirm if the LoRA is actually doing anything in terms of content knowledge additions.

turboderp commented 1 year ago

I don't need to know about the dataset, but there are a bunch of different approaches to training LoRAs, lots of repos that use slightly different methods, adapting different layers etc. Not just one monkey patch. E.g. the original Stanford Alpaca paper trained adapters for the K and V projections, but QLoRA I think defaults to all linear layers, and uses its own quantized format. Then there's GPTQ-LoRA now. And probably a million things in between.

As for whether the LoRA is doing anything, it can be hard to say if you don't have a clear sense of what it should be doing. Alpaca, for instance, trains to complete a prompt in a particular format, but the base model can sometimes decipher that format as well, just "not as well". You could run a perplexity test on the training dataset with the base model and the adapted model, respectively, but that's still no a guarantee that you'll get the results you were hoping for. Choosing the right hyperparameters for training and constructing a good finetuning dataset in the first place is more art than science.

jmoney7823956789378 commented 1 year ago

Ah, I gotcha. I trained soley within the ooba ui, using a plaintext dataset of approximately 33K lines of mostly transcripts from audio lectures (about cyber forensics). As I am a cyber tard and not a code tard, I am not familiar with all the different ranges of LoRA training methods. I also don't get paid enough to manually sift through and turn it all into alpaca formatted json ;(

turboderp commented 1 year ago

It's not really the format that matters for supporting the LoRA, just what layers are targeted by adapters and the datatype they're stored as. But I guess Ooba does have a built-in LoRA training feature, so I can probably get the details I need from there to work out a starting point for it.

As for training data formats, Alpaca is just one format that some researchers got decent results with at one point. You can format the examples any way you like, as long as it represents a structure you can later use to make predictions about new data. And usually you'll want to script it somehow, especially because you want to be able to reformat the data if the format you chose isn't working out.

alain40 commented 1 year ago

Which LoRA? I use text-web-ui because of its convenience (I think this is common, warning small sample size).

text-web-ui uses the peft huggingface library. That library provides the following default target layers:

TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING = { "t5": ["q", "v"], "mt5": ["q", "v"], "bart": ["q_proj", "v_proj"], "gpt2": ["c_attn"], "bloom": ["query_key_value"], "blip-2": ["q", "v", "q_proj", "v_proj"], "opt": ["q_proj", "v_proj"], "gptj": ["q_proj", "v_proj"], "gpt_neox": ["query_key_value"], "gpt_neo": ["q_proj", "v_proj"], "bert": ["query", "value"], "roberta": ["query", "value"], "xlm-roberta": ["query", "value"], "electra": ["query", "value"], "deberta-v2": ["query_proj", "value_proj"], "deberta": ["in_proj"], "layoutlm": ["query", "value"], "llama": ["q_proj", "v_proj"], "chatglm": ["query_key_value"], "gpt_bigcode": ["c_attn"], "mpt": ["Wqkv"], } So same as the original LoRA paper for llama I think.

text-web-ui also supports other paths using qlora (bitsandbytes) and gptqlora (autogptq). I have no experience with them, but easy enough to check source to figure out which layers they freeze/don't freeze.

alain40 commented 1 year ago

LoRA use case My use cases involve training on personal or work text documents. With training LLM gives more coherent answers that just by searching a vector db for embedding similarity.

These documents change often which makes the convenience of dynamic LoRA loading valuable. Hence my initial question.

I think this is a common use case (same warning: small sample size). Not using any huggingface LoRAs. For base LLM model better to just use the merged version. Models do not change often/fast enough that dynanic loading of LoRA makes sense.

turboderp commented 1 year ago

"llama": ["q_proj", "v_proj"],

Okay, so Q and V, that's what I was counting on. It should be simple enough.

Models do not change often/fast enough that dynanic loading of LoRA makes sense.

I disagree, though. The ability to swap LoRAs in and out gives you the ability to use a model in multiple different modes without having to keep multiple versions of it in VRAM. That's extremely useful. Not that you'd want Alpaca one moment and Vicuna the next, but maybe you'd want to be able to switch to a summary mode, or a sentiment analysis mode, or chain-of-thought, or whatever.

alain40 commented 1 year ago

No question about it, there are use cases where changing LLM "mode" by dynamically loading LoRAs would be very valuable.

nivibilla commented 1 year ago

This is exactly what I was doing too. But just in normal HF. Having a database of just LoRA adapters for different tasks. Effectively a Mixture of Experts. And then having another model choose the best one for the query. It's quite slow however. Takes 3secs to load a LoRA. Interested to hear your experience @turboderp

turboderp commented 1 year ago

Well, LoRA support in ExLlama is still kind of experimental. It needs more testing and validation before I'd trust it. But it does seem to be working. And loading a LoRA is extremely quick. It takes some milliseconds to load the 20-100 MB of tensors from a fast SSD, if you don't just keep a bunch of them in memory at the same time. Applying a LoRA is "free", as in it's just an optional argument to the model's forward() function.

It does take a little extra computation, though.

nivibilla commented 1 year ago

I see. That's great news. Only thing left is batch processing and this project can be scaled up. And I know you said somewhere you are working on it. I don't know cpp but If you need help with benchmarking Id be happy to help.

fraferra commented 1 year ago

Thank you so much for all the work you did on Lora support @turboderp ! Is lora stacking part of the roadmap as well?

jmoney7823956789378 commented 1 year ago

I'm also thankful for your efforts on this. Just retrained a 33B lora (had to rent compute since split gpu training was buggy) and it seems to be working somewhat.

5C7AD782-D715-44B9-A7E9-70067B5957F6

I do wish I could feed paragraphs at a time into another model and just have it spit out properly formatted datasets for training... but maybe in another two weeks.

turboderp commented 1 year ago

@fraferra I'm going to look into it, but I'm a little cautious because there's a bit of a performance hit even for a single LoRA.

krzysiekpodk commented 11 months ago

Hey @turboderp would be so kind and give some example how to run model with a LorA? It would mean a world to me as I wasted 3 days and had to reninstall ubuntu from all those different driver testing. :D I'm trying to use Codellama 34b gptq and any Airoboros adapter but no luck :(

I tested example python script, tested docker, cuda 118 and pytorch 2.1.0 and their combinations.

I either have illagal memory access or NaN sign error etc.