turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

CodeLLaMA + LoRA: RuntimeError: CUDA error: an illegal memory access was encountered #290

Open juanps90 opened 9 months ago

juanps90 commented 9 months ago

I am getting this error when trying to do inference with CodeLLaMA34B from The-Bloke + a LoRA trained on the same model using alpaca_lora_4bit.

Commenting out the generator.lora line works.

Hardware is dual RTX 3090 but I'm keeping context length low to a few tokens so that I can test with a single card, here's the output when running a single card, very low context length:

Traceback (most recent call last):
  File "/home/asd/pytests/exllama/test.py", line 230, in <module>
    result_text = generator.generate_simple(prompt, max_new_tokens = 800)
  File "/home/asd/pytests/exllama/generator.py", line 316, in generate_simple
    self.gen_begin(ids, mask = mask)
  File "/home/asd/pytests/exllama/generator.py", line 186, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
  File "/home/asd/pytests/exllama/model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/home/asd/pytests/exllama/model.py", line 1011, in _forward
    attn_mask = torch.zeros(batch_size, 1, seq_len, past_len + seq_len, dtype = torch.float16, device = devs[0])
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also

Traceback (most recent call last):
  File "/home/asd/pytests/exllama/test.py", line 230, in <module>
    result_text = generator.generate_simple(prompt, max_new_tokens = 800)
  File "/home/asd/pytests/exllama/generator.py", line 322, in generate_simple
    token = self.gen_single_token(mask = mask)
  File "/home/asd/pytests/exllama/generator.py", line 352, in gen_single_token
    logits = self.model.forward(self.sequence[:, -1:], self.cache, lora = self.lora, input_mask = mask)
  File "/home/asd/pytests/exllama/model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/home/asd/pytests/exllama/model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/home/asd/pytests/exllama/model.py", line 530, in forward
    self.self_attn.fused(hidden_states, cache, buffer, self.input_layernorm, lora)
  File "/home/asd/pytests/exllama/model.py", line 404, in fused
    attn_weights /= math.sqrt(self.config.head_dim)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
krzysiekpodk commented 9 months ago

I can confirm various issues with GPTQ and Lora - I have tested all available drivers and cuda combinations I have also used Exllama docker.

I either have the issue reported above or this:

RuntimeError: probability tensor contains either inf, nan or element < 0

Interestingly above error occurs when I try inference on long context and forget to put rope base running with Transformers as well.

juanps90 commented 9 months ago

This appears to be related to CodeLLaMA34B as its 13B variant works with LoRA and about 13K context (haven't tried more).

krzysiekpodk commented 8 months ago

I an confirm this issue doesn't replicate in EXL2 LorA implementation, so I don't think its worth to troubleshoot.