turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.23k stars 238 forks source link

Draft model error when predicting at max context #274

Closed ivsanro1 closed 3 weeks ago

ivsanro1 commented 5 months ago

Currently, when doing speculative sampling, the draft model's "lookahead" is not bound by cache_size, resulting in an error if we try to predict past it.

For example, if model ctx len is 4096 (and therefore cache_size too), assuming a num_speculative_tokens = 5, when the generation reaches 4092 tokens, it tries to predict 5 with draft model, and gives error in the 4097 one:

  File "example.py", line 199, in extract
    chunk, eos, _ = self.generator.stream()
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 143, in stream
    next_token, eos = self._gen_single_token(self.settings)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 313, in _gen_single_token
    token, eos = self._gen_single_token_speculative(gen_settings, prefix_token)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 332, in _gen_single_token_speculative
    logits = self.draft_model.forward(draft_sequence_ids[:, -1:], self.draft_cache).float().cpu()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/model.py", line 550, in forward
    assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward
turboderp commented 3 weeks ago

This should be fixed for a while now, and in any case the new generator has a new draft implementation.