Closed bitsnaps closed 4 months ago
Without looking at the code, do we maybe need a newer llamacpp
version?
Without looking at the code, do we maybe need a newer
llamacpp
version?
Teams behind llama.cpp have recently added support for gemma2 architecture.
I am getting the same error trying to run gemma 2 with Ollama on Mac:
-MBP ~ % ollama run gemma2:27b
pulling manifest
pulling aee340a0e20a... 100% ▕████████████████▏ 15 GB
pulling 109037bec39c... 100% ▕████████████████▏ 136 B
pulling 097a36493f71... 100% ▕████████████████▏ 8.4 KB
pulling 22a838ceb7fb... 100% ▕████████████████▏ 84 B
pulling 209e43b1aaf0... 100% ▕████████████████▏ 488 B
verifying sha256 digest
writing manifest
removing any unused layers
success
Error: llama runner process has terminated: signal: abort trap error:error loading model architecture: unknown model architecture: 'gemma2'
Anyone has a solution?
The support of Gemma 2 in upstream llama.cpp is still worked on, as there are missing features. Without the missing features (logits soft-capping and attention scaling factor), generation quality is degraded. This PR contains fixes for these missing features. See bugs linked in the PR.
Has anyone successfully loaded the gguf files(same models as writer) for gemma 2 (9b and 27b)? I'm having trouble with it.
Also having problems with it.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'
Also having problems with it.
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'
https://github.com/oobabooga/text-generation-webui/issues/6184
I simply had to update Ollama and the gemma2:9b worked. I wasn't able to test the 27b since i deleted the file. Will try to download it again and let you guys know if that works as well.
same issue
@sevaroy
LM Studio released a new version (0.2.26) that supports gemma-2 (9b and 27b) including the gguf quantized models. To fix your issue just upgrade to newest LM studio version. It worked for me.
I used the hint above, but the problem was not solved. I'm using a GGUF file Q6_K, I have the same error during the loading process. Model loader - llama.cpp. When will the update with the fixes be available? upd: I downloaded the updates in the dev thread, but it works on the CPU, not the GPU. llama.dll error
should be fixed in dev by https://github.com/oobabooga/text-generation-webui/commit/7e22eaa36c72431dfff78416bb848fadd5701727 with llama-cpp-python 0.2.81 and has logit soft-capping that should fix most of the Gemma 27B generation quality issues.
Still doesn't work, the llama-cpp-python works fine, looks like the issue is oobabooga/llama-cpp-python-cuBLAS-wheels which uses different compiled wheels IMO.
CPU inference worked fine with that fix on dev
branch. Today I tried pulling the newest changes and now it looked like it could work since I didn't get any DLL-related errors, but instead I see this now when trying to load gemma-2-9b-it-Q6_K.gguf
:
06:29:00-541936 ERROR Failed to load the model.
Traceback (most recent call last):
File "F:\...\modules\ui_model_menu.py", line 246, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\...\modules\models.py", line 94, in load_model
output = load_func_map[loader](model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\...\modules\models.py", line 275, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\...\modules\llamacpp_model.py", line 85, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 415, in __init__
self.scores: npt.NDArray[np.single] = np.ndarray(
^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.81 GiB for an array with shape (8192, 256000) and data type float32
Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B029A43BA0>
Traceback (most recent call last):
File "F:\...\modules\llamacpp_model.py", line 33, in __del__
del self.model
^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'
EDIT: nvm, on second try loading the model worked, but now I have a weird error when inferencing 🤔
F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py:1054: RuntimeWarning: Detected duplicate leading "<bos>" in prompt, this will likely reduce response quality, consider removing it...
warnings.warn(
Traceback (most recent call last):
File "F:\...\modules\callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\...\modules\llamacpp_model.py", line 157, in generate
for completion_chunk in completion_chunks:
File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1132, in _create_completion
for token in self.generate(
File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
for output in self.original_generate(*args, **kwargs):
File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
for output in self.original_generate(*args, **kwargs):
File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate
for output in self.original_generate(*args, **kwargs):
[Previous line repeated 2991 more times]
RecursionError: maximum recursion depth exceeded in comparison
Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)
CPU inference worked fine with that fix on
dev
branch. Today I tried pulling the newest changes and now it looked like it could work since I didn't get any DLL-related errors, but instead I see this now when trying to loadgemma-2-9b-it-Q6_K.gguf
:06:29:00-541936 ERROR Failed to load the model. Traceback (most recent call last): File "F:\...\modules\ui_model_menu.py", line 246, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "F:\...\modules\models.py", line 94, in load_model output = load_func_map[loader](model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "F:\...\modules\models.py", line 275, in llamacpp_loader model, tokenizer = LlamaCppModel.from_pretrained(model_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "F:\...\modules\llamacpp_model.py", line 85, in from_pretrained result.model = Llama(**params) ^^^^^^^^^^^^^^^ File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 415, in __init__ self.scores: npt.NDArray[np.single] = np.ndarray( ^^^^^^^^^^^ numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.81 GiB for an array with shape (8192, 256000) and data type float32 Exception ignored in: <function LlamaCppModel.__del__ at 0x000001B029A43BA0> Traceback (most recent call last): File "F:\...\modules\llamacpp_model.py", line 33, in __del__ del self.model ^^^^^^^^^^ AttributeError: 'LlamaCppModel' object has no attribute 'model'
EDIT: nvm, on second try loading the model worked, but now I have a weird error when inferencing 🤔
F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py:1054: RuntimeWarning: Detected duplicate leading "<bos>" in prompt, this will likely reduce response quality, consider removing it... warnings.warn( Traceback (most recent call last): File "F:\...\modules\callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, *args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "F:\...\modules\llamacpp_model.py", line 157, in generate for completion_chunk in completion_chunks: File "F:\...\installer_files\env\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 1132, in _create_completion for token in self.generate( File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate for output in self.original_generate(*args, **kwargs): File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate for output in self.original_generate(*args, **kwargs): File "F:\...\modules\llama_cpp_python_hijack.py", line 113, in my_generate for output in self.original_generate(*args, **kwargs): [Previous line repeated 2991 more times] RecursionError: maximum recursion depth exceeded in comparison Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)
I am getting the same error now as well, except not just with Gemma since updating. Also get the error with Midnight Miqu. Can load the models just fine.
RecursionError: maximum recursion depth exceeded in comparison Output generated in 0.54 seconds (0.00 tokens/s, 0 tokens, context 126, seed 786294289)
I am getting the same error now as well, except not just with Gemma since updating. Also get the error with Midnight Miqu. Can load the models just fine.
That would be a different issue. Please have a look at here for the fix and see if it works for you too. https://github.com/oobabooga/text-generation-webui/issues/6201#issuecomment-2210361743
Is it fixed with 1.9.1 ?
Is it fixed with 1.9.1 ?
Fixed v1.9.1, I confirm, credits to @oobabooga for taking serious time to fix.
Describe the bug
I believe this is a new architecture based on gemma-2 family model. Error message:
Env: Colab-T4 GGUF: gemma-2-9b-it-GGUF Tested file:
gemma-2-9b-it-Q6_K.gguf
P.S. Previous model worked fine.
Is there an existing issue for this?
Reproduction
Open the colab notebook, then replace variables and execute.
Screenshot
No response
Logs
System Info