johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

Merging changes upstream? #13

Open wywywywy opened 1 year ago

wywywywy commented 1 year ago

Now that this proof-of-concept seems to be functional, is it about time to think about merging the changes upstream to respective repos?

text-generation-webui and GPTQ shouldn't be a problem because they are very responsive and very open to new ideas & contributions. Not sure about peft though as there's a company behind it.

johnsmith0031 commented 1 year ago

Yes, may make some pull requests in several days

Ph0rk0z commented 1 year ago

Upstream we will be stuck re-quantizing and using group size.

oobabooga commented 1 year ago

A PR to https://github.com/huggingface/peft and https://github.com/qwopqwop200/GPTQ-for-LLaMa would be awesome.

@Ph0rk0z re-quantization is being worked on here: https://github.com/oobabooga/text-generation-webui/pull/530

If all goes right, the models created by @USBhost will be easier and faster to load and also more accurate than the current decapoda weights.

johnsmith0031 commented 1 year ago

Does anyone have the requantized 4bit llama model weight? I'd like to have the code tested on that first

wywywywy commented 1 year ago

Does anyone have the requantized 4bit llama model weight? I'd like to have the code tested on that first

As @oobabooga has mentioned, @USBhost is in the process of cooking them right now!

Can't wait 😋

Ph0rk0z commented 1 year ago

I tried the https://huggingface.co/ozcur/alpaca-native-4bit

ehartford commented 1 year ago

oh so these ones are not requantized https://huggingface.co/maderix/llama-65b-4bit/tree/main

oobabooga commented 1 year ago

The new quantized weights by @USBhost are available: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-2-get-the-pre-converted-weights

This is how to use them:

python server.py --model llama-7b-4bit --wbits 4 
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128 

See the updated documentation here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode

sterlind commented 1 year ago

FYI, I've submitted a series of PRs to get these changes moving upstream: https://github.com/johnsmith0031/alpaca_lora_4bit/pull/23

mcmonkey4eva commented 1 year ago

Any updates towards getting the patches merged upstream? There's quite a lot of community interest in getting 4bit training support generally available.

I'd love to get training 4bit LoRAs working in the webui (said trainer interface is currently limited to 8bit), currently blocked by not wanting to try to integrate a pile of hack/patch scripts to the webui just to force that to work.

Ph0rk0z commented 1 year ago

I mean it works how I have it. Not in some monkey patch: https://github.com/Ph0rk0z/text-generation-webui-testing/tree/DualModel

The problem of having to use sterlind's GPTQ remains though. I was going to cheat and edit the lora tab to get this feature unofficially.

mcmonkey4eva commented 1 year ago

Ooo, that's looking a lot closer. Ideally should be turned into upstream PRs - @Curlypla 's peft fork PR'd to main peft, and your GPTQ fork merged to GPTQ, then can merge the webui changes to the webui, and, boom, clean impl of 4bit lora of everyone :D If that's working properly then all the code work is done and just the filing the PRs for it is still needed.

Ph0rk0z commented 1 year ago

I know but nobody can agree.. that's why I did this. Plus GPTQ keeps changing the model spec and putting in breaking changes where some cards are unsupported or performance falls.

oobabooga commented 1 year ago

Can I just edit the requirements.txt to use a custom fork of PEFT, then edit the code in https://github.com/oobabooga/GPTQ-for-LLaMa, and call it a day for using LoRAs on top of 4-bit models? People are uploading endless quantizations of models merged with LoRAs to Hugging Face, and that's wasteful when the LoRA itself is like a 20mb file.

wywywywy commented 1 year ago

Can I just edit the requirements.txt to use a custom fork of PEFT, then edit the code in https://github.com/oobabooga/GPTQ-for-LLaMa, and call it a day for using LoRAs on top of 4-bit models? People are uploading endless quantizations of models merged with LoRAs to Hugging Face, and that's wasteful when the LoRA itself is like a 20mb file.

Desperate time calls for desperate measures. It's a yes from me!

mcmonkey4eva commented 1 year ago

The peft change is small and straightforward and really should just be PR'd without issue. The GPTQ changes should at least try getting merged upstream. But, considering the peft fork author hasn't replied and the GPTQ fork author did reply above with a refusal to try... yeah applying the hacky forks for now seems like the best option in the time until people bother to make PRs.

Ph0rk0z commented 1 year ago

There is already a PEFT repo, no need to fork it. It's the GPTQ that has all the changes and multiple ways to go.

oobabooga commented 1 year ago

I have tried simply doing

pip uninstall -y peft
pip install git+https://github.com/huggingface/peft.git@70af02a2bca5a63921790036b2c9430edf4037e2
pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit

and then trying to start the web UI with

python server.py --model llama-7b-4bit-128g --lora tloen_alpaca-lora-7b

and that yielded the old

ValueError: Target module QuantLinear() is not supported. Currently, only torch.nn.Linear and Conv1D are supported.

error.

So I guess changing the requirements is not enough and I also need the monkey patch code @Ph0rk0z https://github.com/johnsmith0031/alpaca_lora_4bit/blob/main/text-generation-webui/custom_monkey_patch.py

which itself depends on the remaining files in this repository. What is the minimal way of getting this working?

For reference, these are the commands to clean up and go back to the previous requirements

pip uninstall peft gptq-llama-0.2
pip install -r requirements.txt --upgrade
johnsmith0031 commented 1 year ago

Maybe it is more convenient to use a independent plugin to monkeypatch everything related in webui? And use an independent cuda kernel from original GPTQ so as to avoid conflict.

oobabooga commented 1 year ago

I ended up going for an optional --monkey-patch flag that causes this repository to be imported and a custom loader to be used. Here is the PR if anyone wants to try it:

https://github.com/oobabooga/text-generation-webui/pull/1200

I have tested it with llama-7b and llama-30b and it seems to work. The output seems to be deterministic for some reason (no idea if that's a bug or not).

A caveat is that once the monkey patch is added, the PEFT library is overwritten in an irreversible way (if I understand correctly). I should add a warning about that.

Ph0rk0z commented 1 year ago

It was easier for me to just integrate the patches into

GPTQ_loader: https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/GPTQ_loader.py Lora.py : https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/LoRA.py

This repo and lora works for more than llama if you genericize the functions: https://github.com/Ph0rk0z/GPTQ-Merged/blob/dual-model/src/alpaca_lora_4bit/autograd_4bit.py

I mean it works no problem and loads/unloads models from the UI without monkey patching or whatever.

You can use sterlind's PEFT or just patch the PEFT like the recent change. Maybe even the genericizing can be sued for training any model in 4bit.

ehartford commented 1 year ago

So we could use it to fine tune Galactica? I always wanted to fine tune Galactica to believe it was Rick Sanchez

On Fri, Apr 14, 2023, 8:50 PM Forkoz @.***> wrote:

It was easier for me to just integrate the patches into

GPTQ_loader: https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/GPTQ_loader.py Lora.py : https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/LoRA.py

This repo and lora works for more than llama if you genericize the functions: https://github.com/Ph0rk0z/GPTQ-Merged/blob/dual-model/src/alpaca_lora_4bit/autograd_4bit.py

I mean it works no problem and loads/unloads models from the UI without monkey patching or whatever.

You can use sterlind's PEFT or just patch the PEFT like the recent change. Maybe even the genericizing can be sued for training any model in 4bit.

— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/13#issuecomment-1509489866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BLTZSFTCNW4OCD4E43XBILHBANCNFSM6AAAAAAWE72PQI . You are receiving this because you commented.Message ID: @.***>

Ph0rk0z commented 1 year ago

Probably.. but I've been too dumb and not had enough GPU time to play with the training yet. I converted other opt. I have all of glactica 30b and I was going to convert it to GPTQ and run it with this. Or try.

oobabooga commented 1 year ago

@Ph0rk0z do you get deterministic outputs in your setup when you add a LoRA to a 4-bit model? I'll see if I can simplify my implementation based on your code.

My tests here seem to indicate that the LoRA is not working as intended and I couldn't identify why yet https://github.com/oobabooga/text-generation-webui/pull/1200#issuecomment-1509294931

Ph0rk0z commented 1 year ago

I used to get different per lora/deterministic responses.. at least on llama. Opt was kind of weird. But the last time I fully tested was the GPTQv1 implementation.

I also loaded models without lora and they were deterministic.