oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.58k stars 5.31k forks source link

I am getting CUDA error: an illegal memory access was encountered #3224

Closed devops35 closed 1 year ago

devops35 commented 1 year ago

Describe the bug

I am making requests with api. After a while, all requests start to give an error. I ran it with the following commands. Same problem with all of them.

python server.py --model TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ --wbits 4 --groupsize 128 --api --listen --auto-devices --model_type llama --xformers --no_use_cuda_fp16 --loader exllama

python server.py --model TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ --wbits 4 --groupsize 128 --api --listen --auto-devices --model_type llama --xformers --loader exllama

python server.py --model TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ --wbits 4 --groupsize 128 --api --listen --model_type llama --xformers --loader exllama

Is there an existing issue for this?

Reproduction

It happens after a random time. It usually occurs after 5-6 hours.

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 683, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.10/http/server.py", line 432, in handle
    self.handle_one_request()
  File "/usr/lib/python3.10/http/server.py", line 420, in handle_one_request
    method()
  File "/workspace/text-generation-webui/extensions/api/blocking_api.py", line 95, in do_POST
    for a in generator:
  File "/workspace/text-generation-webui/modules/chat.py", line 317, in generate_chat_reply
    for history in chatbot_wrapper(text, history, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
  File "/workspace/text-generation-webui/modules/chat.py", line 234, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)):
  File "/workspace/text-generation-webui/modules/text_generation.py", line 23, in generate_reply
    for result in _generate_reply(*args, **kwargs):
  File "/workspace/text-generation-webui/modules/text_generation.py", line 175, in _generate_reply
    clear_torch_cache()
  File "/workspace/text-generation-webui/modules/models.py", line 316, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 133, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

System Info

RTX 6000 Ada - 14 vCPU 188 GB RAM
ghost commented 1 year ago

I have the same issue, but with OpenCL (AMD ROCm / CLBlast). Happens after some time, with any model, when chatting via the Frontend (not API). Never had that issue before.

devops35 commented 1 year ago

@oobabooga how can i solve this problem?

goodglitch commented 1 year ago

Got the same error by using text-generation-webui API with exLlama loader in TavernAI. After restarting text-generation-webui, during the first answer got blue screen with video error.

mageOfstructs commented 1 year ago

I think it happens when the chat history gets too big and for some reason CUDA accesses this illegal memory location. Happened to me as well. First I got some errors about my sequence length being too small and after a while of increasing it, the error appeared. Resetting the history seems to work, but it's obviously no permanent solution.

EDIT: Just my theory, but I think what happens is that the GPU runs out of VRAM. I tried it with different settings and the configurations, which use less VRAM, lived longer.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.