Add lora support? - Githubissues

7Tenku commented 1 year ago

https://github.com/tloen/alpaca-lora

This repo got LLama-7B working with a lora trained on alpaca json file. there is also a notebook with code.

7Tenku commented 1 year ago

https://huggingface.co/tloen/alpaca-lora-7b

lolxdmainkaisemaanlu commented 1 year ago

This would be amazing!

fblissjr commented 1 year ago

I think GPTQ would be where lora support gets added, no?

Given this looks like the key addition from the alpaca lora code -

model = LLaMAForCausalLM.from_pretrained( "decapoda-research/llama-7b-hf", load_in_8bit=True, device_map="auto", ) model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")

oobabooga commented 1 year ago

This should be the next step.

[x] Add a tab where you can load pre-trained LoRAs ~and train your own~

After that we will need someone to come up with the textgen version of civitai :^)

oobabooga commented 1 year ago

WIP here: https://github.com/oobabooga/text-generation-webui/pull/366

wk-mike commented 1 year ago

my device is GTX 1650 4GB，i512400 , 40BG RAM.

I have set llama-7b according to the wiki

I can run it with python server.py --listen --auto-devices --model llama-7b and everything goes well!

But I can't run with --load-in-8bit according to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this. when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit There is no error, everything seeming good，BUT once I use the web ui click the ‘Generate’ button，

there error comes in the terminal

(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

oobabooga commented 1 year ago

@wk-mike I also have a GTX 1650 on my laptop and this error also happens to me when I try to use --load-in-8bit with it.

I have never been able to figure out the cause. You can start a new issue for this with the error message that you just posted, maybe someone else can help.

wk-mike commented 1 year ago

OK!

it can be with cpu, python server.py --listen --cpu --model llama-7b --load-in-8bit I test it, it's ok.

oobabooga commented 1 year ago

Merged now

pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit

Then select the LoRA in the parameters tab. Alternatively, start the web UI with

python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b

wk-mike commented 1 year ago

I can run it with cpu, but still get error with gpu python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --cpu good

python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b --auto-devices not good with

(textgen) wk:text-generation-webui$ python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b --auto-devices

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.83it/s]
Loaded the model in 6.97 seconds.
alpaca-lora-7b
Adding the LoRA alpaca-lora-7b to the model...
Traceback (most recent call last):
  File "/home/wk/data/text-generation-webui/server.py", line 240, in <module>
    add_lora_to_model(shared.lora_name)
  File "/home/wk/data/text-generation-webui/modules/LoRA.py", line 17, in add_lora_to_model
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"))
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 143, in from_pretrained
    model = MODEL_TYPE_TO_PEFT_MODEL_MAPPING[config.task_type](model, config)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 514, in __init__
    super().__init__(model, peft_config)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 79, in __init__
    self.base_model = LoraModel(peft_config, model)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 118, in __init__
    self._find_and_replace()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 163, in _find_and_replace
    new_module = Linear(target.in_features, target.out_features, bias=bias, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 293, in __init__
    nn.Linear.__init__(self, in_features, out_features, **kwargs)
TypeError: Linear.__init__() got an unexpected keyword argument 'has_fp16_weights'

oobabooga commented 1 year ago

It's impressive that this works in CPU mode at all, given that it doesn't seem to work in GPU mode without --load-in-8bit at the moment.

athu16 commented 1 year ago

I can run it with cpu, but still get error with gpu `python server.py --listen --model llama-7b --load-in-8bit --lora alpaca-lora-7b

Hi, did you find any solution for this? I'm having the same issue.

patrickmros commented 1 year ago

Merged now
pip install -r requirements.txt
python download-model.py tloen/alpaca-lora-7b
python server.py --model llama-7b --load-in-8bit
Then select the LoRA in the parameters tab. Alternatively, start the web UI with
python server.py --listen --model llama-7b --load-in-8bit  --lora alpaca-lora-7b

Hm, i did exactly this and i get

server.py: error: unrecognized arguments: --lora alpaca-lora-7b

EDIT: I'm stupid. Forgot to update with git pull. But now i get this error and can't start the web ui even without --lora:

Traceback (most recent call last): File "J:\LLaMA\text-generation-webui\server.py", line 13, in <module> import modules.chat as chat File "J:\LLaMA\text-generation-webui\modules\chat.py", line 14, in <module> from modules.html_generator import fix_newlines, generate_chat_html File "J:\LLaMA\text-generation-webui\modules\html_generator.py", line 11, in <module> import markdown ModuleNotFoundError: No module named 'markdown'

oobabooga commented 1 year ago

Run pip install -r requirements.txt

patrickmros commented 1 year ago

Run pip install -r requirements.txt

I did that. Had to do the 8-bit fix all over again after that and then something else broke and i was so frustrated that i deleted everything and trying a fresh installation now...

oobabooga commented 1 year ago

Try this, it worked for me:

https://github.com/oobabooga/text-generation-webui/issues/400#issuecomment-1474876859

BadisG commented 1 year ago

Hey!

I made the Lora work in 4 bits. python server.py --model llama-7b --gptq-bits 4 --cai-chat

I changed the lora.py from this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py

Here's the modified version (I don't know how to put files on github so I'll put a link) https://pastebin.com/eUWZsirk

I added those 2 instructions on the _find_and_replace() method 1) new_module = None # Add this line to initialize the new_module variable

2) if new_module is None: continue

oobabooga commented 1 year ago

@BadisG I am not sure if this is really working. Here is a test

Prompt

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response:

Preset

Debug-deterministic

LoRA

https://huggingface.co/chansung/alpaca-lora-13b

8-bit mode results

python server.py --load-in-8bit --model llama-13b-hf --listen --lora alpaca-lora-13b

Transformers, the Python library,
Can help you with your data science.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.
It can be used to create models,
And to transform data in a variety of ways.

4-bit mode results

python server.py --gptq-bits 4 --model llama-13b-hf --listen --lora alpaca-lora-13b

Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.

4-bit mode results without any LoRA

python server.py --gptq-bits 4 --model llama-13b-hf --listen

Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.
### Instruction:
Write a poem about the transformers Python library. 
### Response:
Write a poem about the transformers Python library.

BadisG commented 1 year ago

@BadisG I am not sure if this is really working. Here is a test

Are you sure this is the right way to do? Tbh I'm not a specialist on it at all but on llama.cpp you have a seed you can reuse to get the same result all the time, no matter the Generation parameters preset.

If you have something like this on your code maybe you could consider it that way. Either I feel the "Debug Deterministic" is way too restrictive and a simple lora can't change anything either my fix wasn't good enough...

EDIT : The lora works on a random Generation parameters preset, When I put (NovelAI-Sphinx Moth) and I disable "do_sample", it gives the same answer everytime:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response::
The transformer is a robot that can change from one vehicle to another. It has a red body, blue head and yellow arms. The transformer's name is Optimus Prime. He is a leader of the Autobots. His main weapon is his sword. He also has a gun called "the power". He can fly in space or on land. He can go...

When I add the Lora I got this:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a poem about the transformers Python library. 
### Response::
The transformer is a machine learning algorithm that can be used to classify data into different categories, such as cars and trucks. The transformer is based on the idea of neural networks. Neural Networks are a type of artificial intelligence (AI) that uses deep learning to learn from examples. Deep Learning is a branch of AI that learns...

This is what I got from chatgpt about the "do_sample = False"

"if you use do_sample=False, the model uses greedy decoding to generate text, consistently choosing the word with the highest probability. In this case, the text generation process is deterministic, and the use of a seed does not have a significant effect on the results."

In summary, if you want reproductive results, just use do_sample = False and you can choose any Generation parameters preset you want.

if-ai commented 1 year ago

Boss, there is this comment for the 4bit don't know if you saw this already https://github.com/oobabooga/text-generation-webui/issues/332#issuecomment-1474883977 I am in the process of trying it myself

Ph0rk0z commented 1 year ago

Lora 100% is supposed to make it deterministic: https://github.com/oobabooga/text-generation-webui/issues/419

If it is not then the lora isn't working.

oobabooga commented 1 year ago

@Ph0rk0z does that make sense? Why would there be no sampling when a LoRA is in use?

BadisG commented 1 year ago

Lora 100% is supposed to make it deterministic: #419

If it is not then the lora isn't working.

The presence of Lora does not alter the deterministic aspect of your model. Regardless of whether you have Lora or not, you can always modify the reproducibility of your outcomes by adjusting the seed or enabling/disabling the "do_sample" feature.

Ph0rk0z commented 1 year ago

Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.

about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(

BadisG commented 1 year ago

Well 4 bit by itself is deterministic. 8/fp16 was not, unless you count producing a stream of unending garbage every time as deterministic. Turning off do_sample allows 8bit to generate without int8 threshold parameter for me.. but text never appeared. So I think that 4bit lora is going to be suspect, especially without do_sample.

about greedy decoding: https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d In short it is :(

when I put "do_sample = False" and I generate 10 times the text with Lora, I got 10 times the same result ("Text LORA" 10 times). The result is exactly the same when I generate 10 times the text without Lora ("Text NO LORA" 10 times)

But of course "Text LORA" and "Text NO LORA" are different to each other, that's the point of a Lora, to give you something different compared to the raw model

Ph0rk0z commented 1 year ago

Yes.. but do_sample = False generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.

The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.

Also, another question, because I have only 1.5 brain cells. Do things like top_p, and temperature even do anything without do sample?

oobabooga commented 1 year ago

Do things like top_p, and temperature even do anything without do sample?

No they don't, do_sample is the same as greedy sampling.

Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b

BadisG commented 1 year ago

Yes.. but do_sample = False generations are repetitive garbage and you use (NovelAI-Sphinx Moth) in your example. With randomness enabled generation parameters, you can avoid the problems like I had experienced, for a while, too. I really see what that debug preset means when I started using it.

The point of that preset is to be restrictive. Nobody is saying you can't keep using it like this but it still looks broken if it can't even use anything but greedy decoding.

But your "debug preset" also has do_sample = False, that's exactly why it that makes it as a debug preset actually.

The best way to see the reproducibility of an output is to just fix the seed.

On llama.cpp we can do that:

SEED = 1 (Always the same output for a fixed seed)

SEED = 2 (Always the same output for a fixed seed)

Like that you can have a (do_sample = True) + Fixed seed = Good result that will alaways be the same = Perfect reproducibility

Ph0rk0z commented 1 year ago

Do things like top_p, and temperature even do anything without do sample?

No they don't, do_sample is the same as greedy sampling.

Back to the original point: I see people claiming to use this 30b LoRA. How? https://huggingface.co/chansung/alpaca-lora-30b

A6000 48gb? Running it 4bit like he did? Gotta test all and see.

generic-username0718 commented 1 year ago

Is there something I need to do to support multi-gpu configuration lora?

Ph0rk0z commented 1 year ago

Who knows if it support multi. But what I did find is that the lora loads in 8bit using bits&bytes. I ran my test again and output is the same as regular GPTQ llama 13b.

nolora

Btw.. be careful of pulling new GPTQ. It's broke right now.

generic-username0718 commented 1 year ago

I think I'm running into this bug https://github.com/huggingface/peft/issues/115#issuecomment-1460706852

Looks like I may need to modify PeftModel.from_pretrained or PeftModelForCausalLM but I'm not sure where...

generic-username0718 commented 1 year ago

I think something is broken for int8 split-model lora right now... but not sure where to fix... I think this guy did it... https://github.com/huggingface/peft/issues/115#issuecomment-1441016348

generic-username0718 commented 1 year ago

I found a really hacky fix...

I kept on running OOM as the model loads lopsided... so I made the following changes to the modules/LoRA.py file:

1) replace params['device_map'] = {'': 0} with #params['device_map'] = {'': 0} 2) add params['max_memory'] = {0: "16GiB", 1: "25GiB"} just below it.

note: replace 16GiB and 25GiB with whatever launch parameter you're sending to "server.py" as the "--gpu-memory" value

BadisG commented 1 year ago

I've got a new error somehow during the loading of the 13b lora

CUDA SETUP: Loading binary C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\bitsandbyte
s\libbitsandbytes_cuda116.dll...
Adding the LoRA alpaca-lora-13b to the model...
Traceback (most recent call last):
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\routes.py", line 374, i
n run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 1017,
in process_api
    result = await self.call_function(
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\gradio\blocks.py", line 835, i
n call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\to_thread.py", line 31,
in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\anyio\_backends\_asyncio.py",
line 867, in run
    result = context.run(func, *args)
  File "D:\Large Language Models\text-generation-webui\server.py", line 73, in load_lora_wrapper
    add_lora_to_model(selected_lora)
  File "D:\Large Language Models\text-generation-webui\modules\LoRA.py", line 22, in add_lora_to_mod
el
    shared.model = PeftModel.from_pretrained(shared.model, Path(f"loras/{lora_name}"), **params)
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\peft\peft_model.py", line 167,
 in from_pretrained
    max_memory = get_balanced_memory(
  File "C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py",
 line 452, in get_balanced_memory
    per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)
ZeroDivisionError: integer division or modulo by zero

I fixed it by changing th modeling.py file that is on this package: C:\Users\Utilisateur\anaconda3\envs\textgen\lib\site-packages\accelerate\utils\modeling.py

On line 452 you replace this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices)

By this: per_gpu = module_sizes[""] // (num_devices - 1 if low_zero else num_devices) if num_devices != 0 else 0

ndkling commented 1 year ago

@BadisG I ran into that several times during testing as well, but I never tried to solve it because I presumed it was a legitimate error with my GPU not being found. Thank you for all your work, from someone who knows nothing about Python. It seems it's helped others, too.

BadisG commented 1 year ago

@BadisG I ran into that several times during testing as well, but I never tried to solve it because I presumed it was a legitimate error with my GPU not being found. Thank you for all your work, from someone who knows nothing about Python. It seems it's helped others, too.

I know how to code Python, when I said "I'm not a specialist on it at all" that's because I'm just a Data Scientist not a Maching Learning guy 😄

But I'm glad it helped you and the others aswell, it's a pleasure to contribute to the project. :)

Ph0rk0z commented 1 year ago

This fork looks like it fixes lora properly for 4-bit. https://github.com/Curlypla/peft-GPTQ

Snowad14 commented 1 year ago

This fork looks like it fixes lora properly for 4-bit. https://github.com/Curlypla/peft-GPTQ

This is the work of https://github.com/johnsmith0031/alpaca_lora_4bit and it is only to make training with 4bit because it requires re-quantized Llama (For inference, I advise to just do what BadisG said)

Ph0rk0z commented 1 year ago

What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.

Snowad14 commented 1 year ago

Yes I had the same thing but I thought it was just me that had a problem, so yes maybe it will work but you still have to re-quantized Llama

Ph0rk0z commented 1 year ago

So you mean re-quantize the lora? Because the llama model itself already load in 4 bits but the lora is loading in 8.

BadisG commented 1 year ago

What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.

I believe that we can arrive at the correct conclusion for all of this once we have the opportunity to manipulate the seed, allowing us to conduct thorough testing.

https://github.com/oobabooga/text-generation-webui/issues/463

wywywywy commented 1 year ago

So with @BadisG's fix, the _find_and_replace function in lora.py of peft now does nothing with GPTQ, because GPTQ 4-bit models have no Linear layers (they've been packed into QuantLinear layers).

Are we saying we don't actually need _find_and_replace for inference?

Ph0rk0z commented 1 year ago

True, it would help. Right now the behavior for me is as shown in here when I load the model in 8 bit. The lora repairs it.thusly

The reply is quite different to the exact same deterministic prompt. Instead, lora GPTQ replies just like plain gptq and in theory it should not. The only unfair part is that I used the 13b instead of the 7b. I need to put back int8_threshold into models.py and compare the 7b. Maybe try a few prompts.

Are we saying we don't actually need _find_and_replace for inference?

Oh snap.

Arargd commented 1 year ago

What BadisG did to load the lora for inference has no effect on 4-bit GPTQ models. I get same response as default GPTQ.

I believe that we can arrive at the correct conclusion for all of this once we have the opportunity to manipulate the seed, allowing us to conduct thorough testing.

463

Adding torch.manual_seed(seed) (seed representing your seed, for me its "0" in this example) Just before it calls the model to generate seems to work for me.

It generates the same response over and over, however for some reason it does generate a different one occasionally. I suspect its from me pressing the stop button mid generation?

None the less even if you get one of the different responses that show up they are the same ones over and over as well. Same like 2-3 different ones. Its enough that I can test BadisG's LoRA fix. Which seems to have no effect on the generation at all in my testing. Meaning the LoRA isn't properly loaded.

Here's if you want to try and match mine, I'm not sure what all could affect it either, so you may not have luck matching with me, but I'll provide some info anyways.

First of all I added this bit code of code around line 202 of text_generation.py. Should be after clear_torch_cache(). Which is, not a good solution overall, but its exactly before where model.generate() is called in my setup. (And is good enough for me to test it) You may need to find the line for yours if its something else like no-stream, etc. I imagine someone can figure out an actual decent place to put this that will affect all of those! 😓

seed = 0
torch.manual_seed(seed)
print("seed: " + str(seed))

I print my seed so that I know when I click generate its being set, also you can add whatever random function to generate a random seed.

Here's my parameters, which will normally generate random things without my seed:

Here's what it generates using 7b at 4bits, just using the default prompt.

wywywywy commented 1 year ago

@Arargd Don't we need torch.cuda.manual_seed() as well?

seed = 12
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
print("seed: " + str(seed))

Arargd commented 1 year ago

@Arargd Don't we need torch.cuda.manual_seed() as well?
seed = 12
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
print("seed: " + str(seed))

I suppose if it makes a difference for people's systems. With/without it seems to generate the same for me. Mines just hackily patched in, so a better solution overall is probably needed for that matter.

wywywywy commented 1 year ago

I put the above code in, now my GPTQ 4-bit outputs the same with and without Lora. Is my testing method flawed?

Generation using seed 12, 13b weights with alpaca13B-lora.

Prompt

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

### Instruction:
In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017
### Response:

Parameters

do_sample=True
temperature=0.36
top_p=1
typical_p=1
repetition_penalty=1.23
top_k=12
num_beams=1
penalty_alpha=0
min_length=0
length_penalty=1
no_repeat_ngram_size=0
early_stopping=False

Output

Below is an instruction that describes a task. Write a response that appropriately completes the request.

Instruction:

In 10 sentences, summarise the seminal paper called "Attention is All You Need" by Vaswani et al. in 2017

Response::

The authors of this article proposed a new model for attention called Attention Is All You Need (AITAN). The AITAN model has three components: a multi-headed encoder and decoder with self-attentions to capture long range dependencies across input sequences; a softmax classifier which predicts labels from hidden states; and a single linear layer on top of the output of the last head to compute contextualized representations of tokens. They also used the transformer architecture as their base network because it can be trained efficiently using gradient descent methods such as Adam or stochastic gradient decent. In addition, they have shown how to use the Transformer architecture for sequence tagging tasks like named entity recognition and part-of-speech tagging. Finally, the authors have demonstrated that their approach outperforms previous state-of-the art models for these two tasks.

Arargd commented 1 year ago

@BadisG

Huzzah I managed to get the exact same result with your first input there. The reason your LoRA changed on the second is that you added a line there to your prompt Mention the word "large language models" in that poem. is omitted from the first one.

oobabooga / text-generation-webui

Add lora support? #332

Prompt

Preset

LoRA

8-bit mode results

4-bit mode results

4-bit mode results without any LoRA

463

Instruction:

Response::