Currently, when doing speculative sampling, the draft model's "lookahead" is not bound by cache_size, resulting in an error if we try to predict past it.
For example, if model ctx len is 4096 (and therefore cache_size too), assuming a num_speculative_tokens = 5, when the generation reaches 4092 tokens, it tries to predict 5 with draft model, and gives error in the 4097 one:
File "example.py", line 199, in extract
chunk, eos, _ = self.generator.stream()
File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 143, in stream
next_token, eos = self._gen_single_token(self.settings)
File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 313, in _gen_single_token
token, eos = self._gen_single_token_speculative(gen_settings, prefix_token)
File "/usr/local/lib/python3.10/dist-packages/exllamav2/generator/streaming.py", line 332, in _gen_single_token_speculative
logits = self.draft_model.forward(draft_sequence_ids[:, -1:], self.draft_cache).float().cpu()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/exllamav2/model.py", line 550, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
AssertionError: Total sequence length exceeds cache size in model.forward
Currently, when doing speculative sampling, the draft model's "lookahead" is not bound by cache_size, resulting in an error if we try to predict past it.
For example, if model ctx len is
4096
(and thereforecache_size
too), assuming anum_speculative_tokens = 5
, when the generation reaches4092
tokens, it tries to predict 5 with draft model, and gives error in the 4097 one: