oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.61k stars 5.31k forks source link

Target modules [‘q_proj’, ‘v_proj’] not found in the base model #3655

Closed Fusseldieb closed 1 year ago

Fusseldieb commented 1 year ago

Describe the bug

Whenever I try to train a LoRA model, I get following error:

Traceback (most recent call last):

File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\[training.py](http://training.py/)”, line 505, in do_train

lora_model = get_peft_model(shared.model, config)
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\[mapping.py](http://mapping.py/)”, line 106, in get_peft_model

return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py”, line 889, in init

super().__init__(model, peft_config, adapter_name)
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\peft_model.py”, line 111, in init

self.base_model = PEFT_TYPE_TO_MODEL_MAPPING[peft_config.peft_type](
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\tuners\[lora.py](http://lora.py/)”, line 274, in init

super().__init__(model, config, adapter_name)
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\tuners\tuners_utils.py”, line 88, in init

self.inject_adapter(self.model, adapter_name)
File “C:\Users\vstil\Downloads\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\peft\tuners\tuners_utils.py”, line 222, in inject_adapter

raise ValueError(
ValueError: Target modules [‘q_proj’, ‘v_proj’] not found in the base model. Please check the target modules and try again.

I've tried ExLlama_HF and GPTQ-for-Llama. Neither do work and both show the exact same error.

I've tried with this model: "https://huggingface.co/TheBloke/Llama-2-7B-32K-Instruct-GPTQ"

Stale issues that were closed automatically:

Is there an existing issue for this?

Reproduction

  1. Load the model
  2. Go into trainings tab and set everything
  3. Press start
  4. When is shows "Reloading model" it spits out above error

Screenshot

No response

Logs

2023-08-22 21:58:40 WARNING:models\TheBloke_Llama-2-7B-32K-Instruct-GPTQ_gptq-4bit-32g-actorder_True\special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-22 21:58:40 INFO:Loaded the model in 3.28 seconds.

Model reloaded OK, continue with training.
2023-08-22 21:58:40 INFO:Getting model ready...
2023-08-22 21:58:40 INFO:Preparing for training...
2023-08-22 21:58:40 INFO:Creating LoRA model...

System Info

Windows 11 22H2 x64
NVIDIA 2080 8GB VRAM
8GB RAM
oobabooga commented 1 year ago

Use a transformers model with load_in_4bit instead. LoRA training with GPTQ-for-LLaMa is implemented but outdated, and AutoGPTQ integration is not implemented yet

ThereforeGames commented 1 year ago

@oobabooga Is it possible to apply a Lora generated with transformers to a GGML model? I'm getting a similar error when trying to do so. I can apply it to GPTQ though. Thanks.

Fusseldieb commented 1 year ago

Use a transformers model with load_in_4bit instead. LoRA training with GPTQ-for-LLaMa is implemented but outdated, and AutoGPTQ integration is not implemented yet

What would a transformers model be, a GGML one?

From what I've seen, at least on my PC, GGML models are much MUCH slower than their GPTQ counterparts (GGML: 0.5t/sec, GPTQ: 35t/sec), which is what made me go "all-in" on GPTQ models. Or is there a way to make GGML model as fast as GPTQ ones?

Cheers!

oobabooga commented 1 year ago

No, GGML is something else. Transformers models look like this:

https://huggingface.co/lmsys/vicuna-33b-v1.3/tree/main

They have files named pytorch_model-00001-of-00007.bin or similar, and they do not have 4bit, GGML, or GPTQ in their names.

Ph0rk0z commented 1 year ago

Is it possible to apply a Lora generated with transformers to a GGML model? I'm getting a similar error when trying to do so. I can apply it to GPTQ though. Thanks.

Be me, want to use 70b ggml loras. Convert the adapters to ggml format. They double in size. Write code to use them through the ui. The moment is finally here. The lora is loading.

"offloading of layers to GPU with lora is only supported for the F16 model type"

hypersniper05 commented 1 year ago

Use a transformers model with load_in_4bit instead. LoRA training with GPTQ-for-LLaMa is implemented but outdated, and AutoGPTQ integration is not implemented yet

@oobabooga There's a bug that had me confused. If monkey_patch is on and you try to train using a Transformers model you get this message on console ModuleNotFoundError: No module named 'monkeypatch' perhaps a check will prevent this? Thanks again

tqman commented 1 year ago

@hypersniper05 If you read here there's more to it than just monkeypatch:

https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode

You also need to install alpaca_lora_4bit in repositories and install a patched version of GPTQ-For-Llama from @sterlind

However, when I do all that, I still get an error, so either it's broken or there's something else I'm missing.

oobabooga commented 1 year ago

The monkey patch was created for GPTQ-for-LLaMa ages ago. AutoGPTQ training integration is not implemented yet (if someone wants to do it and submit a PR, that would be helpful); once implemented, it will not require the monkey patch.

For now, I recommend downloading a 16-bit transformers model and training it with --load-in-4bit instead of using a GPTQ model.

hypersniper05 commented 1 year ago

@hypersniper05 If you read here there's more to it than just monkeypatch:

https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md#using-loras-in-4-bit-mode

You also need to install alpaca_lora_4bit in repositories and install a patched version of GPTQ-For-Llama from @sterlind

However, when I do all that, I still get an error, so either it's broken or there's something else I'm missing.

@tqman Thanks , I have it installed aready from source, it's just the new update broke it. I am training 16-bit models like @oobabooga stated but we are limited in how deep we can train the model because it does require more VRAM than GPTQ models.

Ph0rk0z commented 1 year ago

Use the repo directly.https://github.com/johnsmith0031/alpaca_lora_4bit/tree/winglian-setup_pip

Finetune.py is back and simple to set up. Unless you are using a pascal GPU it is much better and quicker.

Really the load-in-4bit uses a similar amount of vram but it is slower. There is also https://github.com/OpenAccess-AI-Collective/axolotl

hypersniper05 commented 1 year ago

@oobabooga Looks like AutoGPTQ is now part of Transformers https://huggingface.co/blog/gptq-integration

Ph0rk0z commented 1 year ago

If they did good then the code can just work as is, same as for BnB

oobabooga commented 1 year ago

Okay, that's quite impressive: it just works. Just load a GPTQ model with --loader transformers --auto-devices, or in the UI by selecting the "Transformers" loader and checking this checkbox:

auto

I did that and managed to train a tiny LoRA without errors and without changing anything else. The important bit was to activate "auto-devices", otherwise the model doesn't load.

There you go, GPTQ LoRA training is now implemented lol. That saves me so much trouble.

hypersniper05 commented 1 year ago

I tried just now and I get this message ImportError: Loading GPTQ quantized model requires optimum library : pip install optimum and auto-gptq library ‘pip install auto-gptq’ I am going to install optium but it does specify this in the blog https://huggingface.co/blog/gptq-integration#native-support-of-gptq-models-in-%F0%9F%A4%97-transformers

hypersniper05 commented 1 year ago

@oobabooga There a note about exllama kernels when fine tunning - "Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft."

link: https://huggingface.co/docs/transformers/main/en/main_classes/quantization#exllama-kernels-for-faster-inference

Maybe a option to disable it like AutoGPTQ but in Transformers loader ui?

oobabooga commented 1 year ago

Strangely though, it works for some models but not others. I tried Karen_theEditor-13B-4bit-128g-GPTQ by @FartyPants and got this error:

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models/FPHam_Karen_theEditor-13B-4bit-128g-GPTQ.

This was the command:

python server.py --model models/FPHam_Karen_theEditor-13B-4bit-128g-GPTQ/ --auto-devices --loader transformers --listen

For Llama-2-7b-Chat-GPTQ it works:

python server.py --model models/TheBloke_Llama-2-7b-Chat-GPTQ --auto-devices --loader transformers --listen

@hypersniper05 yes, optimum is required. I have added it to the requirements: https://github.com/oobabooga/text-generation-webui/commit/0576691538da4f93d50b9ee28c252eb91c4f0da3

hypersniper05 commented 1 year ago

Strangely though, it works for some models but not others. I tried Karen_theEditor-13B-4bit-128g-GPTQ by @FartyPants and got this error:

OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models/FPHam_Karen_theEditor-13B-4bit-128g-GPTQ.

This was the command:

python server.py --model models/FPHam_Karen_theEditor-13B-4bit-128g-GPTQ/ --auto-devices --loader transformers --listen

For Llama-2-7b-Chat-GPTQ it works:

python server.py --model models/TheBloke_Llama-2-7b-Chat-GPTQ --auto-devices --loader transformers --listen

@hypersniper05 yes, optimum is required. I have added it to the requirements: 0576691

@oobabooga That because the quantization_config.json has to be present in the model folder see link : https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-quantized-model-from-the-hub

Here is the files for llama 2 7b link you posted above and it has it but the Karen does not

Screenshot 2023-08-31 115241
oobabooga commented 1 year ago

Karen does have a quantize_config.json though: https://huggingface.co/FPHam/Karen_theEditor-13B-4bit-128g-GPTQ/blob/main/quantize_config.json

The one in TheBloke: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ/blob/main/quantize_config.json

There were two missing keys, namely:

    "model_name_or_path": null,
    "model_file_base_name": "model"

I have tried manually adding them and it didn't work.

hypersniper05 commented 1 year ago

@oobabooga Ill look into it. I added a pull request that adds the disable_exllama to the GPTQConfig arg #3772

I am going to help you on the training side. Are you open to having a Tools tab? I was thinking oh adding the feature to quantize 16bit models using the transformer library. Maybe also merge lora weights and a feature to push to hugginface

Edit: By the way I am training now using a GPTQ model and works great TheBloke/WizardCoder-Python-34B-V1.0-GPTQ . Lets see what happens when I add the lora

ThereforeGames commented 1 year ago

Doesn't seem to be working for me. I get the following error after about 15 steps of training:

  File "T:\code\python\oobabooga\installer_files\env\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 406, in update
    found_inf_combined = found_infs[0]
IndexError: list index out of range
hypersniper05 commented 1 year ago

Doesn't seem to be working for me. I get the following error after about 15 steps of training:

  File "T:\code\python\oobabooga\installer_files\env\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 406, in update
    found_inf_combined = found_infs[0]
IndexError: list index out of range

Same here, it happens as soon as the console prints the loss. I'll take a look tonight and see if I can fix it

Error

2023-08-31 16:39:05 INFO:Loading TheBloke_WizardCoder-Python-34B-V1.0-GPTQ...
2023-08-31 16:39:52 WARNING:models\TheBloke_WizardCoder-Python-34B-V1.0-GPTQ\tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-31 16:39:52 WARNING:models\TheBloke_WizardCoder-Python-34B-V1.0-GPTQ\special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-08-31 16:39:52 INFO:Loaded the model in 47.04 seconds.

Model reloaded OK, continue with training.
2023-08-31 16:39:52 INFO:Getting model ready...
2023-08-31 16:39:52 INFO:Preparing for training...
2023-08-31 16:39:52 INFO:Creating LoRA model...
2023-08-31 16:39:52 INFO:Starting training...
Training 'llama' model using (q, v) projections
Trainable params: 39,321,600 (6.9667 %), All params: 564,420,608 (Model: 525,099,008)
2023-08-31 16:39:52 INFO:Log file 'train_dataset_sample.json' created in the 'logs' directory.
Exception in thread Thread-19 (threaded_run):
Traceback (most recent call last):
  File "D:\AI\oobabooga_windows\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "D:\AI\oobabooga_windows\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "D:\AI\oobabooga_windows\text-generation-webui\modules\training.py", line 655, in threaded_run
    trainer.train()
  File "D:\AI\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1555, in train
    return inner_training_loop(
  File "D:\AI\oobabooga_windows\installer_files\env\lib\site-packages\transformers\trainer.py", line 1916, in _inner_training_loop
    self.optimizer.step()
  File "D:\AI\oobabooga_windows\installer_files\env\lib\site-packages\accelerate\optimizer.py", line 132, in step
    self.scaler.step(self.optimizer, closure)
  File "D:\AI\oobabooga_windows\installer_files\env\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
2023-08-31 16:48:17 INFO:Training complete, saving...
2023-08-31 16:48:17 INFO:Training complete!
Ph0rk0z commented 1 year ago

Well.. windows :(

tqman commented 1 year ago

@hypersniper05 Any luck on resolving that? I'm also consistently getting the "AssertionError: No inf checks were recorded for this optimizer" issue, even when using https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ as linked above by @oobabooga

onexixi commented 1 year ago

It continued to work again when I upgraded the version! pip3 install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu118 you need torch-2.2.0.dev20230916+cu118-cp310-cp310-win_amd64.whl Use the latest version of @tqman @Fusseldieb @hypersniper05 @Ph0rk0z @Vinno97

mrroll commented 1 year ago

Has anyone successfully trained TheBloke/falcon-40b-instruct-GPTQ?

  1. Inference works when Loading it via AutoGPTQ even if it errors out with with ERROR:Failed to load the model.. But any attempt to train ends up with an error similar to the original issue.

ValueError: Target modules [‘q_proj’, ‘v_proj’] not found in the base model. Please check the target modules and try again.

  1. Loading it via Transformers, as mentioned in this comment by @oobabooga, appears to load the model with the same ERROR:Failed to load the model. error.

However, inference no longer works as it did when loaded via AutoGPTQ. I get the following error:

  File "/home/mrroll/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40B-instruct-GPTQ/modelling_RW.py", line 290, in forward
    attn_output = F.scaled_dot_product_attention(
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

In addition, attempting to train yields the same error as if loaded with AutGPTQ

ValueError: Target modules [‘q_proj’, ‘v_proj’] not found in the base model. Please check the target modules and try again.

oobabooga commented 1 year ago

I have added the disable_exllama option for Transformers here: https://github.com/oobabooga/text-generation-webui/commit/36c38d756137ee335b37430c4a14eada1e7ece2e

LoRA training with GPTQ models should work now. Make sure to load the model using the Transformers loader with both auto-devices and disable_exllama checked.

mrroll commented 1 year ago

I have added the disable_exllama option for Transformers here: 36c38d7

LoRA training with GPTQ models should work now. Make sure to load the model using the Transformers loader with both auto-devices and disable_exllama checked.

Thanks for this! When trying to do inference on the most up-to-date version of the repository, I am still getting this error:

RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

I'm usingTheBloke/falcon-40b-instruct-GPTQ, specifically.

Any chance you or anyone else has ideas on how to get this going?

VegaStarlake commented 1 year ago

I have added the disable_exllama option for Transformers here: 36c38d7

LoRA training with GPTQ models should work now. Make sure to load the model using the Transformers loader with both auto-devices and disable_exllama checked.

I got LoRA training start for the first time by using what onexixi linked. After updating and doing ^, the lora training now returns the following shortened error:

Traceback (most recent call last):
File “E:\one-click-installers-main\text-generation-webui\modules\[training.py](http://training.py/)”, line 510, in do_train
set_peft_model_state_dict(lora_model, state_dict_peft)
File “E:\one-click-installers-main\installer_files\env\lib\site-packages\peft\utils\save_and_load.py”, line 135, in set_peft_model_state_dict
load_result = model.load_state_dict(peft_model_state_dict, strict=False)
File “E:\one-click-installers-main\installer_files\env\lib\site-packages\torch\nn\modules\[module.py](http://module.py/)”, line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

…

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([8, 4096]) from checkpoint, the shape in current model is torch.Size([64, 4096]).

 size mismatch for base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight: copying a param with shape torch.Size([4096, 8]) from checkpoint, the shape in current model is torch.Size([4096, 64]).

It continued to work again when I upgraded the version! pip3 install numpy --pre torch torchvision torchaudio --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu118 you need torch-2.2.0.dev20230916+cu118-cp310-cp310-win_amd64.whl Use the latest version of

This might have worked for me. I was getting the lora to begin training, but I don't know if it ever fully completed. I ran into an error that "completed" the lora every time it tried to run an eval. None of the loras seem to do anything though. I tried one other LoRA-specific UI from github, and that never worked.

Should I make a new issue for this? Anyone else have these issues/know any fixes?

Edit: After doing onexixi's suggestion again, the LoRA training now begins again after updating and with both exllama kernel disabled and auto devices turned on. I don't know if it's going to actually complete or not. I also noticed the 4096 is the value I had set for the max tokens or context length with a SuperHot model since the full 8k didn't fit on my gpu. I did not have the context changed for the model I was trying to train (the model ooba suggested above) and I don't even see the options when loading with transformers so I don't know how or why that is there. 64 was the value I had set to the LoRA Rank for that attempt. These numbers could be unrelated, but it seems like more than a coincidence. I still have issues and idk if any of the issues I mentioned have been solved as of this edit 29-9.

hypersniper05 commented 1 year ago

@VegaStarlake your lora rank is too high

VegaStarlake commented 1 year ago

@hypersniper05 Could you explain or provide a source? I haven't seen any documentation about limits on the lora rank other than the mention that it can increase vram usage but my error doesn't seem to be a memory issue (I could be wrong though). But more importantly, when I force re-installed cuda the training started with all the training settings exactly the same as they were in the first error. i.e. doing what onexixi linked "fixed" that error with no changes to LoRA Rank.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

alexdlhh commented 9 months ago

I am training with theBloke llama2 7b, using the following model load configuration image on the training tab image

my equipment has these characteristics:

.m2 samsung evo 2tb i5 12400 Nvidia 4060Ti 16gb 16gb Ram DDR4

all the other model loaders I have tried have failed with the same error as in the main thread, also I have only succeeded with GPTQ, with GUFF I have not been able to, does anyone know if it is even possible?

Ph0rk0z commented 9 months ago

GGUF has no support for training in this UI. You can do it through native llama.cpp