oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.39k stars 5.18k forks source link

exllama_hf throws error when num_beams > 1 #3061

Closed tensiondriven closed 11 months ago

tensiondriven commented 1 year ago

Describe the bug

While attempting to work with exllama_hf, I discovered that passing num_beams > 1 during inference when using exllama_hf results in an exception (see below).

Additional notes:

This was discovered while diagnosing https://github.com/oobabooga/text-generation-webui/issues/3028

Is there an existing issue for this?

Reproduction

1) Run text-generation-webui:

python server.py --listen --listen-port $port \
    --loader exllama_hf \
    --gpu-split 16,24 \
    --model VicUnlocked-alpaca-65b-4bit \
    --api \
        --verbose

2) Perform inference via API or UI with num_beams: 2 (or any value greater than 1)

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/home/j/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/home/j/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/j/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/j/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1665, in generate
    return self.beam_sample(
  File "/home/j/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 3278, in beam_sample
    next_token_scores_processed = logits_processor(input_ids, next_token_scores)
  File "/home/j/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 92, in __call__
    scores = processor(input_ids, scores)
  File "/home/j/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 203, in __call__
    score = torch.gather(scores, 1, input_ids)
RuntimeError: Size does not match at dimension 0 expected index [2, 162] to be smaller than self [1, 32000] apart from dimension 1

Note: the value [2, 162] seems to depend on the context length.


### System Info

```shell
Ubuntu 22.04 headless, Nvidia 3090
oobabooga commented 1 year ago

exllama_hf is not complete. Some things don't work, like contrastive search or perplexity evaluation. Possibly no_repeat_ngram is one of those things

tensiondriven commented 1 year ago

no_repeat_ngrams works; num_beams does not.

(I conflated these because I was using both in my testing.)

Ergonomics-wise, if these issues are known, it would make a lot of sense to me to be defensive and explicitly throw when an unsupported feature is used, since it will result in an exception anyway

practical-dreamer commented 1 year ago

does exllama (non-hf) work with num_beams?

[edit] it doesn't seem to crash like exllama_hf but I don't think it's working. VRAM usage isn't spiking and I'm not seeing it alter its own text as it's being generated like I normally do

practical-dreamer commented 1 year ago

just wanted to mention for those with VRAM to spare running with bnb 4bit is a viable alternative for num_beams

github-actions[bot] commented 11 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.