Closed 7Tenku closed 1 year ago
This would be amazing!
I think GPTQ would be where lora support gets added, no?
Given this looks like the key addition from the alpaca lora code -
model = LLaMAForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", load_in_8bit=True, device_map="auto", ) model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")
This should be the next step.
After that we will need someone to come up with the textgen version of civitai :^)
my device is GTX 1650 4GB,i512400 , 40BG RAM.
I have set llama-7b according to the wiki
I can run it with python server.py --listen --auto-devices --model llama-7b
and everything goes well!
But I can't run with --load-in-8bit
according to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this.
when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit
There is no error, everything seeming good,BUT once I use the web ui click the ‘Generate’ button,
there error comes in the terminal
(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00, 4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
outputs = self.model(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
@wk-mike I also have a GTX 1650 on my laptop and this error also happens to me when I try to use --load-in-8bit
with it.
I have never been able to figure out the cause. You can start a new issue for this with the error message that you just posted, maybe someone else can help.
OK!
it can be with cpu,
python server.py --listen --cpu --model llama-7b --load-in-8bit
I test it, it's ok.
Merged now
pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit
Then select the LoRA in the parameters tab. Alternatively, start the web UI with
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
I can run it with cpu, but still get error with gpu
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --cpu
good
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices
not good
with
(textgen) wk:text-generation-webui$ python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00, 4.83it/s]
Loaded the model in 6.97 seconds.
alpaca-lora-7b
Adding the LoRA alpaca-lora-7b to the model...
Traceback (most recent call last):
File "/home/wk/data/text-generation-webui/server.py", line 240, in <module>
add_lora_to_model(shared.lora_name)
File "/home/wk/data/text-generation-webui/modules/LoRA.py", line 17, in add_lora_to_model
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 143, in from_pretrained
model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 514, in __init__
super().__init__(model, peft_config)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 79, in __init__
self.base_model = LoraModel(peft_config, model)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 118, in __init__
self._find_and_replace()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 163, in _find_and_replace
new_module = Linear(target.in_features, target.out_features, bias=bias, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 293, in __init__
nn.Linear.__init__(self, in_features, out_features, **kwargs)
TypeError: Linear.__init__() got an unexpected keyword argument 'has_fp16_weights'
It's impressive that this works in CPU mode at all, given that it doesn't seem to work in GPU mode without --load-in-8bit
at the moment.
I can run it with cpu, but still get error with gpu `python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
Hi, did you find any solution for this? I'm having the same issue.
Merged now
pip install -r requirements.txt python download-model.py tloen/alpaca-lora-7b python server.py --model llama-7b --load-in-8bit
Then select the LoRA in the parameters tab. Alternatively, start the web UI with
python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b
Hm, i did exactly this and i get
server.py: error: unrecognized arguments: --lora alpaca-lora-7b
EDIT: I'm stupid. Forgot to update with git pull. But now i get this error and can't start the web ui even without --lora:
Traceback (most recent call last): File "J:\LLaMA\text-generation-webui\server.py", line 13, in <module> import modules.chat as chat File "J:\LLaMA\text-generation-webui\modules\chat.py", line 14, in <module> from modules.html_generator import fix_newlines, generate_chat_html File "J:\LLaMA\text-generation-webui\modules\html_generator.py", line 11, in <module> import markdown ModuleNotFoundError: No module named 'markdown'
Run pip install -r requirements.txt
Run
pip install -r requirements.txt
I did that. Had to do the 8-bit fix all over again after that and then something else broke and i was so frustrated that i deleted everything and trying a fresh installation now...
Try this, it worked for me:
https://github.com/oobabooga/text-generation-webui/issues/400#issuecomment-1474876859
Hey!
I made the Lora work in 4 bits. python server.py --model llama-7b --gptq-bits 4 --cai-chat
I changed the lora.py from this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py
Here's the modified version (I don't know how to put files on github so I'll put a link) https://pastebin.com/eUWZsirk
I added those 2 instructions on the _find_and_replace() method 1) new_module = None # Add this line to initialize the new_module variable
2) if new_module is None: continue
@BadisG I am not sure if this is really working. Here is a test
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Debug-deterministic
https://huggingface.co/chansung/alpaca-lora-13b
python server.py --load-in-8bit --model llama-13b-hf --listen --lora alpaca-lora-13b
Transformers, the Python library,
Can help you with your data science.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
python server.py --gptq-bits 4 --model llama-13b-hf --listen --lora alpaca-lora-13b
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
python server.py --gptq-bits 4 --model llama-13b-hf --listen
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library.
### Response:
Write a poem about the transformers Python library.
@BadisG I am not sure if this is really working. Here is a test
Are you sure this is the right way to do? Tbh I'm not a specialist on it at all but on llama.cpp you have a seed you can reuse to get the same result all the time, no matter the Generation parameters preset.
If you have something like this on your code maybe you could consider it that way. Either I feel the "Debug Deterministic" is way too restrictive and a simple lora can't change anything either my fix wasn't good enough...
EDIT : The lora works on a random Generation parameters preset, When I put (NovelAI-Sphinx Moth) and I disable "do_sample", it gives the same answer everytime:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response::
The transformer is a robot that can change from one vehicle to another. It has a red body, blue head and yellow arms. The transformer's name is Optimus Prime. He is a leader of the Autobots. His main weapon is his sword. He also has a gun called "the power". He can fly in space or on land. He can go...
When I add the Lora I got this:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library.
### Response::
The transformer is a machine learning algorithm that can be used to classify data into different categories, such as cars and trucks. The transformer is based on the idea of neural networks. Neural Networks are a type of artificial intelligence (AI) that uses deep learning to learn from examples. Deep Learning is a branch of AI that learns...
This is what I got from chatgpt about the "do_sample = False"
"if you use do_sample=False, the model uses greedy decoding to generate text, consistently choosing the word with the highest probability. In this case, the text generation process is deterministic, and the use of a seed does not have a significant effect on the results."
In summary, if you want reproductive results, just use do_sample = False and you can choose any Generation parameters preset you want.
Boss, there is this comment for the 4bit don't know if you saw this already https://github.com/oobabooga/text-generation-webui/issues/332#issuecomment-1474883977 I am in the process of trying it myself
Lora 100% is supposed to make it deterministic: https://github.com/oobabooga/text-generation-webui/issues/419
If it is not then the lora isn't working.
@Ph0rk0z does that make sense? Why would there be no sampling when a LoRA is in use?
Lora 100% is supposed to make it deterministic: #419
If it is not then the lora isn't working.
The presence of Lora does not alter the deterministic aspect of your model. Regardless of whether you have Lora or not, you can always modify the reproducibility of your outcomes by adjusting the seed or enabling/disabling the "do_sample" feature.
Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.
about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(
Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.
about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(
when I put "do_sample = False" and I generate 10 times the text with Lora, I got 10 times the same result ("Text LORA" 10 times). The result is exactly the same when I generate 10 times the text without Lora ("Text NO LORA" 10 times)
But of course "Text LORA" and "Text NO LORA" are different to each other, that's the point of a Lora, to give you something different compared to the raw model
Yes.. but do_sample = False
generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.
The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.
Also, another question, because I have only 1.5 brain cells. Do things like top_p, and temperature even do anything without do sample?
Do things like top_p, and temperature even do anything without do sample?
No they don't, do_sample is the same as greedy sampling.
Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b
Yes.. but
do_sample = False
generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.
But your "debug preset" also has do_sample = False, that's exactly why it that makes it as a debug preset actually.
The best way to see the reproducibility of an output is to just fix the seed.
On llama.cpp we can do that:
SEED = 1 (Always the same output for a fixed seed)
SEED = 2 (Always the same output for a fixed seed)
Like that you can have a (do_sample = True) + Fixed seed = Good result that will alaways be the same = Perfect reproducibility
Do things like top_p, and temperature even do anything without do sample?
No they don't, do_sample is the same as greedy sampling.
Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b
A6000 48gb? Running it 4bit like he did? Gotta test all and see.
Is there something I need to do to support multi-gpu configuration lora?
Who knows if it support multi. But what I did find is that the lora loads in 8bit using bits&bytes. I ran my test again and output is the same as regular GPTQ llama 13b.
Btw.. be careful of pulling new GPTQ. It's broke right now.
I think I'm running into this bug https://github.com/huggingface/peft/issues/115#issuecomment-1460706852
Looks like I may need to modify PeftModel.from_pretrained or PeftModelForCausalLM but I'm not sure where...
I think something is broken for int8 split-model lora right now... but not sure where to fix... I think this guy did it... https://github.com/huggingface/peft/issues/115#issuecomment-1441016348
I found a really hacky fix...
I kept on running OOM as the model loads lopsided... so I made the following changes to the modules/LoRA.py file:
1) replace params['device_map'] = {'': 0}
with #params['device_map'] = {'': 0}
2) add params['max_memory'] = {0: "16GiB", 1: "25GiB"}
just below it.
note: replace 16GiB
and 25GiB
with whatever launch parameter you're sending to "server.py" as the "--gpu-memory" value
I've got a new error somehow during the loading of the 13b lora
CUDA SETUP: Loading binary C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\bitsandbyte
s\libbitsandbytes_cuda116.dll...
Adding the LoRA alpaca-lora-13b to the model...
Traceback (most recent call last):
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\routes.py", line 374, i
n run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 1017,
in process_api
result = await self.call_function(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 835, i
n call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\to_thread.py", line 31,
in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 937, in run_sync_in_worker_thread
return await future
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 867, in run
result = context.run(func, *args)
File "D:\Large Language Models\text-generation-webui\server.py", line 73, in load_lora_wrapper
add_lora_to_model(selected_lora)
File "D:\Large Language Models\text-generation-webui\modules\LoRA.py", line 22, in add_lora_to_mod
el
shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\peft_model.py", line 167,
in from_pretrained
max_memory = get_balanced_memory(
File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py",
line 452, in get_balanced_memory
per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero
I fixed it by changing th modeling.py file that is on this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py
On line 452 you replace this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
By this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices) if num_devices != 0 else 0
@BadisG I ran into that several times during testing as well, but I never tried to solve it because I presumed it was a legitimate error with my GPU not being found. Thank you for all your work, from someone who knows nothing about Python. It seems it's helped others, too.
@BadisG I ran into that several times during testing as well, but I never tried to solve it because I presumed it was a legitimate error with my GPU not being found. Thank you for all your work, from someone who knows nothing about Python. It seems it's helped others, too.
I know how to code Python, when I said "I'm not a specialist on it at all" that's because I'm just a Data Scientist not a Maching Learning guy 😄
But I'm glad it helped you and the others aswell, it's a pleasure to contribute to the project. :)
This fork looks like it fixes lora properly for 4-bit. https://github.com/Curlypla/peft-GPTQ
This fork looks like it fixes lora properly for 4-bit. https://github.com/Curlypla/peft-GPTQ
This is the work of https://github.com/johnsmith0031/alpaca_lora_4bit and it is only to make training with 4bit because it requires re-quantized Llama (For inference, I advise to just do what BadisG said)
What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.
Yes I had the same thing but I thought it was just me that had a problem, so yes maybe it will work but you still have to re-quantized Llama
So you mean re-quantize the lora? Because the llama model itself already load in 4 bits but the lora is loading in 8.
What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.
I believe that we can arrive at the correct conclusion for all of this once we have the opportunity to manipulate the seed, allowing us to conduct thorough testing.
https://github.com/oobabooga/text-generation-webui/issues/463
So with @BadisG's fix, the _find_and_replace
function in lora.py
of peft
now does nothing with GPTQ, because GPTQ 4-bit models have no Linear
layers (they've been packed into QuantLinear
layers).
Are we saying we don't actually need _find_and_replace
for inference?
True, it would help. Right now the behavior for me is as shown in here when I load the model in 8 bit. The lora repairs it.thusly
The reply is quite different to the exact same deterministic prompt. Instead, lora GPTQ replies just like plain gptq and in theory it should not. The only unfair part is that I used the 13b instead of the 7b. I need to put back int8_threshold into models.py and compare the 7b. Maybe try a few prompts.
Are we saying we don't actually need
_find_and_replace
for inference?
Oh snap.
What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.
I believe that we can arrive at the correct conclusion for all of this once we have the opportunity to manipulate the seed, allowing us to conduct thorough testing.
463
Adding torch.manual_seed(seed) (seed representing your seed, for me its "0" in this example) Just before it calls the model to generate seems to work for me.
It generates the same response over and over, however for some reason it does generate a different one occasionally. I suspect its from me pressing the stop button mid generation?
None the less even if you get one of the different responses that show up they are the same ones over and over as well. Same like 2-3 different ones. Its enough that I can test BadisG's LoRA fix. Which seems to have no effect on the generation at all in my testing. Meaning the LoRA isn't properly loaded.
Here's if you want to try and match mine, I'm not sure what all could affect it either, so you may not have luck matching with me, but I'll provide some info anyways.
First of all I added this bit code of code around line 202
of text_generation.py
. Should be after clear_torch_cache()
. Which is, not a good solution overall, but its exactly before where model.generate() is called in my setup. (And is good enough for me to test it) You may need to find the line for yours if its something else like no-stream, etc. I imagine someone can figure out an actual decent place to put this that will affect all of those! 😓
seed = 0
torch.manual_seed(seed)
print("seed: " + str(seed))
I print my seed so that I know when I click generate its being set, also you can add whatever random function to generate a random seed.
Here's my parameters, which will normally generate random things without my seed:
Here's what it generates using 7b at 4bits, just using the default prompt.
@Arargd Don't we need torch.cuda.manual_seed()
as well?
seed = 12
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
print("seed: " + str(seed))
@Arargd Don't we need
torch.cuda.manual_seed()
as well?seed = 12 torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) print("seed: " + str(seed))
I suppose if it makes a difference for people's systems. With/without it seems to generate the same for me. Mines just hackily patched in, so a better solution overall is probably needed for that matter.
I put the above code in, now my GPTQ 4-bit outputs the same with and without Lora. Is my testing method flawed?
Generation using seed 12, 13b weights with alpaca13B-lora.
Prompt
Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017
### Response:
Parameters
do_sample=True
temperature=0.36
top_p=1
typical_p=1
repetition_penalty=1.23
top_k=12
num_beams=1
penalty_alpha=0
min_length=0
length_penalty=1
no_repeat_ngram_size=0
early_stopping=False
Output
Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction:
In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017
Response::
The authors of this article proposed a new model for attention called Attention Is All You Need (AITAN). The AITAN model has three components: a multi-headed encoder and decoder with self-attentions to capture long range dependencies across input sequences; a softmax classifier which predicts labels from hidden states; and a single linear layer on top of the output of the last head to compute contextualized representations of tokens. They also used the transformer architecture as their base network because it can be trained efficiently using gradient descent methods such as Adam or stochastic gradient decent. In addition, they have shown how to use the Transformer architecture for sequence tagging tasks like named entity recognition and part-of-speech tagging. Finally, the authors have demonstrated that their approach outperforms previous state-of-the art models for these two tasks.
@BadisG
Huzzah I managed to get the exact same result with your first input there. The reason your LoRA changed on the second is that you added a line there to your prompt Mention the word "large language models" in that poem.
is omitted from the first one.
https://github.com/tloen/alpaca-lora
This repo got LLama-7B working with a lora trained on alpaca json file. there is also a notebook with code.