Unknown model architecture: 'gemma2

bitsnaps commented 4 months ago

Describe the bug

I believe this is a new architecture based on gemma-2 family model. Error message:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2

Env: Colab-T4 GGUF: gemma-2-9b-it-GGUF Tested file: gemma-2-9b-it-Q6_K.gguf

P.S. Previous model worked fine.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Open the colab notebook, then replace variables and execute.

Screenshot

No response

Logs

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'
llama_load_model_from_file: failed to load model
╭──────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────╮
│ /content/text-generation-webui/server.py:242 in <module>                                                                  │
│                                                                                                                           │
│   241         # Load the model                                                                                            │
│ ❱ 242         shared.model, shared.tokenizer = load_model(model_name)                                                     │
│   243         if shared.args.lora:                                                                                        │
│                                                                                                                           │
│ /content/text-generation-webui/modules/models.py:94 in load_model                                                         │
│                                                                                                                           │
│    93     shared.args.loader = loader                                                                                     │
│ ❱  94     output = load_func_map[loader](model_name)                                                                      │
│    95     if type(output) is tuple:                                                                                       │
│                                                                                                                           │
│ /content/text-generation-webui/modules/models.py:272 in llamacpp_loader                                                   │
│                                                                                                                           │
│   271     logger.info(f"llama.cpp weights detected: \"{model_file}\"")                                                    │
│ ❱ 272     model, tokenizer = LlamaCppModel.from_pretrained(model_file)                                                    │
│   273     return model, tokenizer                                                                                         │
│                                    
│ /content/text-generation-webui/modules/llamacpp_model.py:103 in from_pretrained                                           │
│                                                                                                                           │
│   102                                                                                                                     │
│ ❱ 103         result.model = Llama(**params)                                                                              │
│   104         if cache_capacity > 0:                                                                                      │
│                                                                                                                           │
│ /usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/llama.py:358 in __init__                                           │
│                                                                                                                           │
│    357                                                                                                                    │
│ ❱  358         self._model = self._stack.enter_context(contextlib.closing(_LlamaModel(                                    │
│    359             path_model=self.model_path, params=self.model_params, verbose=self.verbose                             │
│                                                                                                                           │
│ /usr/local/lib/python3.10/dist-packages/llama_cpp_cuda/_internals.py:54 in __init__                                       │
│                                                                                                                           │
│    53         if self.model is None:                                                                                      │
│ ❱  54             raise ValueError(f"Failed to load model from file: {path_model}")                                       │
│    55                                                                                                                     │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Failed to load model from file: models/gemma-2-9b-it-Q6_K.gguf
Exception ignored in: <function LlamaCppModel.__del__ at 0x7d5747e867a0>
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/llamacpp_model.py", line 58, in __del__
    del self.model
AttributeError: model

System Info

Colab & GPU-T4

rinukkusu commented 4 months ago

Without looking at the code, do we maybe need a newer llamacpp version?

bitsnaps commented 4 months ago

Without looking at the code, do we maybe need a newer llamacpp version?

Teams behind llama.cpp have recently added support for gemma2 architecture.

OriginalGoku commented 4 months ago

I am getting the same error trying to run gemma 2 with Ollama on Mac: -MBP ~ % ollama run gemma2:27b pulling manifest pulling aee340a0e20a... 100% ▕████████████████▏ 15 GB
pulling 109037bec39c... 100% ▕████████████████▏ 136 B
pulling 097a36493f71... 100% ▕████████████████▏ 8.4 KB
pulling 22a838ceb7fb... 100% ▕████████████████▏ 84 B
pulling 209e43b1aaf0... 100% ▕████████████████▏ 488 B
verifying sha256 digest writing manifest removing any unused layers success Error: llama runner process has terminated: signal: abort trap error:error loading model architecture: unknown model architecture: 'gemma2' Anyone has a solution?

theo77186 commented 4 months ago

The support of Gemma 2 in upstream llama.cpp is still worked on, as there are missing features. Without the missing features (logits soft-capping and attention scaling factor), generation quality is degraded. This PR contains fixes for these missing features. See bugs linked in the PR.

93041025 commented 4 months ago

Has anyone successfully loaded the gguf files(same models as writer) for gemma 2 (9b and 27b)? I'm having trouble with it.

mindwellsolutions commented 4 months ago

Also having problems with it.

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'

93041025 commented 4 months ago

Also having problems with it.

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'

https://github.com/oobabooga/text-generation-webui/issues/6184

OriginalGoku commented 4 months ago

I simply had to update Ollama and the gemma2:9b worked. I wasn't able to test the 27b since i deleted the file. Will try to download it again and let you guys know if that works as well.

sevaroy commented 4 months ago

螢幕擷取畫面 2024-07-03 031533

same issue

mindwellsolutions commented 4 months ago

@sevaroy

LM Studio released a new version (0.2.26) that supports gemma-2 (9b and 27b) including the gguf quantized models. To fix your issue just upgrade to newest LM studio version. It worked for me.

DmitryVN commented 4 months ago

I used the hint above, but the problem was not solved. I'm using a GGUF file Q6_K, I have the same error during the loading process. Model loader - llama.cpp. When will the update with the fixes be available? upd: I downloaded the updates in the dev thread, but it works on the CPU, not the GPU. llama.dll error

theo77186 commented 4 months ago

should be fixed in dev by https://github.com/oobabooga/text-generation-webui/commit/7e22eaa36c72431dfff78416bb848fadd5701727 with llama-cpp-python 0.2.81 and has logit soft-capping that should fix most of the Gemma 27B generation quality issues.

bitsnaps commented 4 months ago

Still doesn't work, the llama-cpp-python works fine, looks like the issue is oobabooga/llama-cpp-python-cuBLAS-wheels which uses different compiled wheels IMO.

rinukkusu commented 4 months ago

CPU inference worked fine with that fix on dev branch. Today I tried pulling the newest changes and now it looked like it could work since I didn't get any DLL-related errors, but instead I see this now when trying to load gemma-2-9b-it-Q6_K.gguf:

06:29:00-541936 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "F:\...\modules\ui_model_menu.py", line 246, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\models.py", line 94, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\models.py", line 275, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\llamacpp_model.py", line 85, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 415, in __init__
    self.scores: npt.NDArray[np.single] = np.ndarray(
                                          ^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.81 GiB for an array with shape (8192, 256000) and data type float32

Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B029A43BA0>
Traceback (most recent call last):
  File "F:\...\modules\llamacpp_model.py", line 33, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

EDIT: nvm, on second try loading the model worked, but now I have a weird error when inferencing 🤔

F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py:1054: RuntimeWarning: Detected duplicate leading "<bos>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
Traceback (most recent call last):
  File "F:\...\modules\callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\llamacpp_model.py", line 157, in generate
    for completion_chunk in completion_chunks:
  File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1132, in _create_completion
    for token in self.generate(
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  [Previous line repeated 2991 more times]
RecursionError: maximum recursion depth exceeded in comparison
Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)

GeneralJarrett97 commented 4 months ago

CPU inference worked fine with that fix on dev branch. Today I tried pulling the newest changes and now it looked like it could work since I didn't get any DLL-related errors, but instead I see this now when trying to load gemma-2-9b-it-Q6_K.gguf:

06:29:00-541936 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "F:\...\modules\ui_model_menu.py", line 246, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\models.py", line 94, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\models.py", line 275, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\llamacpp_model.py", line 85, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 415, in __init__
    self.scores: npt.NDArray[np.single] = np.ndarray(
                                          ^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.81 GiB for an array with shape (8192, 256000) and data type float32

Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B029A43BA0>
Traceback (most recent call last):
  File "F:\...\modules\llamacpp_model.py", line 33, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

EDIT: nvm, on second try loading the model worked, but now I have a weird error when inferencing 🤔

F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py:1054: RuntimeWarning: Detected duplicate leading "<bos>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(
Traceback (most recent call last):
  File "F:\...\modules\callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\...\modules\llamacpp_model.py", line 157, in generate
    for completion_chunk in completion_chunks:
  File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1132, in _create_completion
    for token in self.generate(
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
    for output in self.original_generate(*args, **kwargs):
  [Previous line repeated 2991 more times]
RecursionError: maximum recursion depth exceeded in comparison
Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)

I am getting the same error now as well, except not just with Gemma since updating. Also get the error with Midnight Miqu. Can load the models just fine.

TeuMasaki commented 4 months ago

RecursionError: maximum recursion depth exceeded in comparison Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)
I am getting the same error now as well, except not just with Gemma since updating. Also get the error with Midnight Miqu. Can load the models just fine.

That would be a different issue. Please have a look at here for the fix and see if it works for you too. https://github.com/oobabooga/text-generation-webui/issues/6201#issuecomment-2210361743

HonZuna commented 4 months ago

Is it fixed with 1.9.1 ?

bitsnaps commented 4 months ago

Is it fixed with 1.9.1 ?

Fixed v1.9.1, I confirm, credits to @oobabooga for taking serious time to fix.

oobabooga / text-generation-webui