turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.77k stars 220 forks source link

Completion abruptly stopped - RuntimeError: CUDA error: an illegal memory access was encountered #273

Open Thireus opened 1 year ago

Thireus commented 1 year ago

The following is sometimes happening while completion is ongoing for large context sizes.

Traceback (most recent call last):
  File "/home/username/text-generation-webui/modules/callbacks.py", line 56, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/home/username/text-generation-webui/modules/text_generation.py", line 321, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/home/username/text-generation-webui/modules/exllama_hf.py", line 87, in __call__
    logits = self.ex_model.forward(torch.tensor([seq[-1:]], dtype=torch.long), ex_cache, lora=self.lora).to(input_ids.device)
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 530, in forward
    self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 376, in fused
    key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len).narrow(0, 0, bsz)
RuntimeError: start (0) + length (4097) exceeds dimension size (4096).
Exception in thread Thread-172 (gentask):
Traceback (most recent call last):
  File "/home/username/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Traceback (most recent call last):
  File "/home/username/text-generation-webui/modules/text_generation.py", line 328, in generate_reply_HF
    yield get_reply_from_output_ids(output, input_ids, original_question, state, is_chat=is_chat)
  File "/home/username/text-generation-webui/modules/text_generation.py", line 206, in get_reply_from_output_ids
    if shared.tokenizer.convert_ids_to_tokens(int(output_ids[-new_tokens])).startswith('▁'):
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
turboderp commented 1 year ago

According to the error message, it's attempting to generate at position 4097, so it's exceeding the sequence length you've set. I have to assume this is an issue in text-generation-webui. (?)