llamacpp_hf crashes when trying to generate text

Technologicat commented 11 months ago

Describe the bug

When the model is loaded using llamacpp_hf, text generation crashes upon pressing Generate in the Chat tab.

The technical cause is that for some reason, outputs.hidden_states does not exist.

This would be very nice to get working, because:

llamacpp is fast enough for realtime use with a 7B model, even with just 20 layers offloaded to the GPU.
llamacpp is one of the few loaders that allows partial GPU offloading.
- Partial offloading is absolutely crucial for those of us on a laptop, since even gaming laptops often have no more than 8 GB VRAM.
- Even with a 7B model at Q5_K_M, 8 GB runs out as the context grows. Especially note that the recently popular Mistral family of 7B models comes with a larger context window than many others.
The _hf variant, specifically, provides the min_p sampler, which I'm sure many of us have heard good things about.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Start text-generation-webui (run ./start_linux.sh, then open your web browser to http://localhost:7860/)
In the Model tab, pick a GGUF model (tested with dolphin-2.1-mistral-7b.Q5_K_M.gguf from here)
Model loader ⊳ pick llamacpp_hf
Press the Load button
Parameters ⊳ Generation ⊳ pick Contrastive search
Switch to the Chat tab
From the "≡" menu, pick Start new chat
Enter some text, and press Enter
Observe the crash in the terminal.

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/home/****/oobabooga_linux/modules/callbacks.py", line 57, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/****/oobabooga_linux/modules/text_generation.py", line 355, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/****/oobabooga_linux/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/****/oobabooga_linux/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 1623, in generate
    return self.contrastive_search(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/****/oobabooga_linux/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/****/oobabooga_linux/installer_files/env/lib/python3.11/site-packages/transformers/generation/utils.py", line 2016, in contrastive_search
    last_hidden_states = outputs.hidden_states[-1]
                         ~~~~~~~~~~~~~~~~~~~~~^^^^
TypeError: 'NoneType' object is not subscriptable
Output generated in 1.54 seconds (0.00 tokens/s, 0 tokens, context 284, seed 146776359)

System Info

OS: Linux Mint 21.1 Vera
GPU: NVIDIA GeForce RTX 3070 Ti mobile (8 GB)

Extensions: code_syntax_highlight, gallery, ui_tweaks.

I also have superboogav2 installed, but because it closely interacts with the internals of the system, I disabled it to test this. Just to be sure, after disabling the extension and saving settings, I cold-booted text-generation-webui by Ctrl+C'ing the process in the terminal, and then running the start script again. (Apply flags/extensions and restart doesn't always work correctly, but never mind that now - that's a separate issue.)

I have installed a tokenizer for llamacpp_hf using Option 1: download oobabooga/llama-tokenizer under "Download model or LoRA". I understood that's the only step needed?

EDIT: Ah, the system info field of the bug report doesn't take Markdown. Fixed the formatting.

Technologicat commented 11 months ago

Aaaaa!

Contrastive Search (only works for the Transformers loader at the moment). --https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab

Using another preset, the llamacpp_hf loader no longer crashes. Sorry for the noise.

Curiously, though, Contrastive search seems to work just fine with the llama.cpp loader.

So maybe instead of a bug report, this could be changed into a feature request: it would be nice to have a warning in the UI if some presets are known not to be compatible with some loaders. I only discovered this when I decided to systematically read through the latest user manual on the wiki.

EDIT / final note: You'll need the correct tokenizer for the model. The default LLaMa one, as suggested by Option 1, isn't compatible with all models. If your model starts producing gibberish, it could be due to an incompatible tokenizer. Thus, prefer Option 2. Specifically for dolphin-2.1-mistral-7b.Q5_K_M.gguf, you can obtain the tokenizer files from its original unquantized model repo.

Technologicat commented 10 months ago

Closing this, since I got it working, by using another preset instead of Contrastive search.

Now that min_p sampling is available for the llama.cpp loader, that's preferable anyway.

Nevertheless, the documentation could more clearly state that Contrastive Search is not expected to work with all loaders.

oobabooga / text-generation-webui