Draft model error when predicting at max context

Currently, when doing speculative sampling, the draft model's "lookahead" is not bound by cache_size, resulting in an error if we try to predict past it.

For example, if model ctx len is 4096 (and therefore cache_size too), assuming a num_speculative_tokens = 5, when the generation reaches 4092 tokens, it tries to predict 5 with draft model, and gives error in the 4097 one:

  File "example.py", line 199, in extract
    chunk, eos, _ = self.generator.stream()
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 143, in stream
    next_token, eos = self._gen_single_token(self.settings)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 313, in _gen_single_token
    token, eos = self._gen_single_token_speculative(gen_settings, prefix_token)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 332, in _gen_single_token_speculative
    logits = self.draft_model.forward(draft_sequence_ids[:, -1:], self.draft_cache).float().cpu()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/exllamav2/model.py", line 550, in forward
    assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward

turboderp / exllamav2

Draft model error when predicting at max context #274