The following is sometimes happening while completion is ongoing for large context sizes.
My context size was: 3,262
The max_new_tokens set: 4,096
Traceback (most recent call last):
File "/home/username/text-generation-webui/modules/callbacks.py", line 56, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/home/username/text-generation-webui/modules/text_generation.py", line 321, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1642, in generate
return self.sample(
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2724, in sample
outputs = self(
File "/home/username/text-generation-webui/modules/exllama_hf.py", line 87, in __call__
logits = self.ex_model.forward(torch.tensor([seq[-1:]], dtype=torch.long), ex_cache, lora=self.lora).to(input_ids.device)
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 967, in forward
r = self._forward(input_ids[:, chunk_begin : chunk_end],
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 1053, in _forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 530, in forward
self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 376, in fused
key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len).narrow(0, 0, bsz)
RuntimeError: start (0) + length (4097) exceeds dimension size (4096).
Exception in thread Thread-172 (gentask):
Traceback (most recent call last):
File "/home/username/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
File "/home/username/text-generation-webui/modules/text_generation.py", line 328, in generate_reply_HF
yield get_reply_from_output_ids(output, input_ids, original_question, state, is_chat=is_chat)
File "/home/username/text-generation-webui/modules/text_generation.py", line 206, in get_reply_from_output_ids
if shared.tokenizer.convert_ids_to_tokens(int(output_ids[-new_tokens])).startswith('▁'):
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
According to the error message, it's attempting to generate at position 4097, so it's exceeding the sequence length you've set. I have to assume this is an issue in text-generation-webui. (?)
The following is sometimes happening while completion is ongoing for large context sizes.