Open wywywywy opened 1 year ago
Yes, may make some pull requests in several days
Upstream we will be stuck re-quantizing and using group size.
A PR to https://github.com/huggingface/peft and https://github.com/qwopqwop200/GPTQ-for-LLaMa would be awesome.
@Ph0rk0z re-quantization is being worked on here: https://github.com/oobabooga/text-generation-webui/pull/530
If all goes right, the models created by @USBhost will be easier and faster to load and also more accurate than the current decapoda weights.
Does anyone have the requantized 4bit llama model weight? I'd like to have the code tested on that first
Does anyone have the requantized 4bit llama model weight? I'd like to have the code tested on that first
As @oobabooga has mentioned, @USBhost is in the process of cooking them right now!
Can't wait 😋
I tried the https://huggingface.co/ozcur/alpaca-native-4bit
oh so these ones are not requantized https://huggingface.co/maderix/llama-65b-4bit/tree/main
The new quantized weights by @USBhost are available: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-2-get-the-pre-converted-weights
This is how to use them:
python server.py --model llama-7b-4bit --wbits 4
python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
See the updated documentation here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
FYI, I've submitted a series of PRs to get these changes moving upstream: https://github.com/johnsmith0031/alpaca_lora_4bit/pull/23
Any updates towards getting the patches merged upstream? There's quite a lot of community interest in getting 4bit training support generally available.
I'd love to get training 4bit LoRAs working in the webui (said trainer interface is currently limited to 8bit), currently blocked by not wanting to try to integrate a pile of hack/patch scripts to the webui just to force that to work.
I mean it works how I have it. Not in some monkey patch: https://github.com/Ph0rk0z/text-generation-webui-testing/tree/DualModel
The problem of having to use sterlind's GPTQ remains though. I was going to cheat and edit the lora tab to get this feature unofficially.
Ooo, that's looking a lot closer. Ideally should be turned into upstream PRs - @Curlypla 's peft fork PR'd to main peft, and your GPTQ fork merged to GPTQ, then can merge the webui changes to the webui, and, boom, clean impl of 4bit lora of everyone :D If that's working properly then all the code work is done and just the filing the PRs for it is still needed.
I know but nobody can agree.. that's why I did this. Plus GPTQ keeps changing the model spec and putting in breaking changes where some cards are unsupported or performance falls.
Can I just edit the requirements.txt to use a custom fork of PEFT, then edit the code in https://github.com/oobabooga/GPTQ-for-LLaMa, and call it a day for using LoRAs on top of 4-bit models? People are uploading endless quantizations of models merged with LoRAs to Hugging Face, and that's wasteful when the LoRA itself is like a 20mb file.
Can I just edit the requirements.txt to use a custom fork of PEFT, then edit the code in https://github.com/oobabooga/GPTQ-for-LLaMa, and call it a day for using LoRAs on top of 4-bit models? People are uploading endless quantizations of models merged with LoRAs to Hugging Face, and that's wasteful when the LoRA itself is like a 20mb file.
Desperate time calls for desperate measures. It's a yes from me!
The peft change is small and straightforward and really should just be PR'd without issue. The GPTQ changes should at least try getting merged upstream. But, considering the peft fork author hasn't replied and the GPTQ fork author did reply above with a refusal to try... yeah applying the hacky forks for now seems like the best option in the time until people bother to make PRs.
There is already a PEFT repo, no need to fork it. It's the GPTQ that has all the changes and multiple ways to go.
I have tried simply doing
pip uninstall -y peft
pip install git+https://github.com/huggingface/peft.git@70af02a2bca5a63921790036b2c9430edf4037e2
pip install git+https://github.com/sterlind/GPTQ-for-LLaMa.git@lora_4bit
and then trying to start the web UI with
python server.py --model llama-7b-4bit-128g --lora tloen_alpaca-lora-7b
and that yielded the old
ValueError: Target module QuantLinear() is not supported. Currently, only
torch.nn.Linear
andConv1D
are supported.
error.
So I guess changing the requirements is not enough and I also need the monkey patch code @Ph0rk0z https://github.com/johnsmith0031/alpaca_lora_4bit/blob/main/text-generation-webui/custom_monkey_patch.py
which itself depends on the remaining files in this repository. What is the minimal way of getting this working?
For reference, these are the commands to clean up and go back to the previous requirements
pip uninstall peft gptq-llama-0.2
pip install -r requirements.txt --upgrade
Maybe it is more convenient to use a independent plugin to monkeypatch everything related in webui? And use an independent cuda kernel from original GPTQ so as to avoid conflict.
I ended up going for an optional --monkey-patch
flag that causes this repository to be imported and a custom loader to be used. Here is the PR if anyone wants to try it:
https://github.com/oobabooga/text-generation-webui/pull/1200
I have tested it with llama-7b and llama-30b and it seems to work. The output seems to be deterministic for some reason (no idea if that's a bug or not).
A caveat is that once the monkey patch is added, the PEFT library is overwritten in an irreversible way (if I understand correctly). I should add a warning about that.
It was easier for me to just integrate the patches into
GPTQ_loader: https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/GPTQ_loader.py Lora.py : https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/LoRA.py
This repo and lora works for more than llama if you genericize the functions: https://github.com/Ph0rk0z/GPTQ-Merged/blob/dual-model/src/alpaca_lora_4bit/autograd_4bit.py
I mean it works no problem and loads/unloads models from the UI without monkey patching or whatever.
You can use sterlind's PEFT or just patch the PEFT like the recent change. Maybe even the genericizing can be sued for training any model in 4bit.
So we could use it to fine tune Galactica? I always wanted to fine tune Galactica to believe it was Rick Sanchez
On Fri, Apr 14, 2023, 8:50 PM Forkoz @.***> wrote:
It was easier for me to just integrate the patches into
GPTQ_loader: https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/GPTQ_loader.py Lora.py : https://github.com/Ph0rk0z/text-generation-webui-testing/blob/DualModel/modules/LoRA.py
This repo and lora works for more than llama if you genericize the functions: https://github.com/Ph0rk0z/GPTQ-Merged/blob/dual-model/src/alpaca_lora_4bit/autograd_4bit.py
I mean it works no problem and loads/unloads models from the UI without monkey patching or whatever.
You can use sterlind's PEFT or just patch the PEFT like the recent change. Maybe even the genericizing can be sued for training any model in 4bit.
— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/13#issuecomment-1509489866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BLTZSFTCNW4OCD4E43XBILHBANCNFSM6AAAAAAWE72PQI . You are receiving this because you commented.Message ID: @.***>
Probably.. but I've been too dumb and not had enough GPU time to play with the training yet. I converted other opt. I have all of glactica 30b and I was going to convert it to GPTQ and run it with this. Or try.
@Ph0rk0z do you get deterministic outputs in your setup when you add a LoRA to a 4-bit model? I'll see if I can simplify my implementation based on your code.
My tests here seem to indicate that the LoRA is not working as intended and I couldn't identify why yet https://github.com/oobabooga/text-generation-webui/pull/1200#issuecomment-1509294931
I used to get different per lora/deterministic responses.. at least on llama. Opt was kind of weird. But the last time I fully tested was the GPTQv1 implementation.
I also loaded models without lora and they were deterministic.
Now that this proof-of-concept seems to be functional, is it about time to think about merging the changes upstream to respective repos?
text-generation-webui
andGPTQ
shouldn't be a problem because they are very responsive and very open to new ideas & contributions. Not sure aboutpeft
though as there's a company behind it.