oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.69k stars 5.21k forks source link

Segmentation fault with llama.cpp when making two API requests in quick succession #5939

Open p-e-w opened 5 months ago

p-e-w commented 5 months ago

Describe the bug

With the llama.cpp loader, when a running API request is cancelled, followed quickly by dispatching a second API request, the whole application crashes with a segmentation fault. This appears to happen with any GGUF model (confirmed with Mixtral, Yi-34b, Command R) when split CPU/GPU evaluation is used.

This has been happening for months, but only now have I managed to pinpoint a reliable reproduction. This might be the same thing described in #5630.

Is there an existing issue for this?

Reproduction

  1. Start TGWUI with --api.
  2. Load a GGUF model with llama.cpp.
  3. Make a request to the text completion API.
  4. Cancel the request while it is running.
  5. Immediately (after 1 second or less) make another request identical to the first one.

Screenshot

No response

Logs

Segmentation fault (core dumped)

Nothing else in the logs.

System Info

RTX 3060 12 GB on Ubuntu 22.04.
Xyem commented 3 months ago

I keep hitting this issue as I have SillyTavern running but also some ComfyUI nodes which use the same Oobabooga backend. It isn't necessary to cancel the first request for the segmentation fault, it will happen if the two requests happen roughly at the same time.

As I don't update my instance unless I have a specific reason to do so, I've only been getting this issue recently.