Segfault with llama-cpp-python >= 0.2.59

dgdguk commented 7 months ago

Describe the bug

Since the update to llama_cpp_python/llama_cpp_python_cuda 0.2.59, there's a segfault on at least an AMD RX7900XT when using models loaded with the llama.cpp loader. This seems to occur once the prompt is sufficiently long, with small models getting maybe a hundred tokens worth of dialogue, and larger models falling over almost immediately.

Downgrading llama-cpp-python and llama-cpp-python-cuda to 0.2.56 fixes the issue, so a quick fix for anyone affected is to do that.

According to subsequent reports by @Touch-Night and @jepjoo, 0.2.60 does not fix this issue. Further, while the original report only listed AMD GPU as the problem, is seems that both the nVidia GPU and CPU platforms are also impacted. The precise problem seems to be underlying issues in llama_cpp_python which are causing a crash on certain LLM architectures - Mistral models crash almost instantly (context length >= 2), Phi models take tens of tokens before crashing.

Reproduction

Update to commit #5810 or later.
Load a Mistral derived model
Using chat-instruct, give it a prompt with more than a few tokens. It should segfault almost immediately.

System Info

GPU: AMD RX7900XT
System: Manjaro Linux, Kernel 6.8.3.

Subsequent reports also illustrate crashes on nVidia and CPU.

dgdguk commented 7 months ago

Additionally, looking into the llama_cpp project suggests that there are quite a few reports of crashes on different hardware with similar behaviour on various models, e.g. https://github.com/abetlen/llama-cpp-python/issues/1326 or https://github.com/abetlen/llama-cpp-python/issues/1319.

My guess is that 0.2.59 might need some more time to bake, and perhaps reverting to 0.2.56 is generally a good idea.

Touch-Night commented 7 months ago

Also on Windows 10 with Nvidia RTX 3050Ti Laptop (modified to 8GB VRAM), with Yi or Qwen model loaded, crashed without any error messages, only "Press any key to continue..." when evaluating prompt. Downgrading llama-cpp-python to 0.2.56 fixed it.

oldgithubman commented 7 months ago

More untested breakages, yay. How do we, "Downgrad[e] llama-cpp-python and llama-cpp-python-cuda to 0.2.56"?

LovelyA72 commented 7 months ago

How do we, "Downgrad[e] llama-cpp-python and llama-cpp-python-cuda to 0.2.56"?

I modified my requirements.txt. delete the install folder and install again. I am bringing my own venv so I just need to re pip install -r requirements.txt Edit: make sure to run webui directly with python server.py afterwards

dgdguk commented 7 months ago

More untested breakages, yay. How do we, "Downgrad[e] llama-cpp-python and llama-cpp-python-cuda to 0.2.56"?

This is tricky to describe - the manual commands vary depending on GPU, OS etc. Further, one_click.py rolls all the update functionality into one go, so you can't say roll back a git commit and then update requirements (note: this functionality would be really useful for testing/debugging!).

I think the easiest way to do it would be to use git to roll back to Commit 308452b, activate the Conda environment and run python -c "import one_click; one_click.update_requirements(pull=False)", which should update based on the old requirements file.

Touch-Night commented 7 months ago

@oobabooga have just updated llama-cpp-python's version to 0.2.60. Anyone tried if it fix this issue?

Touch-Night commented 7 months ago

@oobabooga have just updated llama-cpp-python's version to 0.2.60. Anyone tried if it fix this issue?

Just tested, the answer is no.

jepjoo commented 7 months ago

Still crashes for me instantly when entering prompt.

Running all layers on Nvidia GPU, tensorcores on, Win 11.

edit: same result if loading on just CPU.

dgdguk commented 7 months ago

Thanks for testing - I've added the extra info to the bug report.

My guess is that there's some kind of memory safety issue in the CPU component of llama_cpp_python which is causing the problem, probably https://github.com/abetlen/llama-cpp-python/issues/1326. If I can, I'll see if I can bisect that code to narrow it down.

@oobabooga I would strongly recommend downgrading the requirements for llama_cpp_python/llama_cpp_python_cuda to 0.2.56, as right now the llama.cpp backend is broken for new installers / updates.

St33lMouse commented 7 months ago

Same problem for me.

LawnMo commented 7 months ago

Same problem for me.

the issue dgdguk linked above points to a temporary solution, in the 'Model' tab, tick logits_all when you load a model with llamacpp and llamacpp_HF, despite the warning about making prompts eval slower, for now it fixes the crash and (in my case, ymmv) doesn't seem slower than usual.

dgdguk commented 7 months ago

As of #5823, llama_cpp_python/llama_cpp_python_cuda have been downgraded to 0.2.56, which appears to have fixed the issue. If you're still having problems, upgrade to the latest version and it should take care of the requirements (at worst, delete the Conda environment and reinstall).

I do think that this issue does highlight that the project needs a sane way of reverting to an older version though. I know that development moves quickly, but from any software development point of view, making the only version available through the update the current bleeding edge is somewhat crazy.

In any case, as this is now fixed, I'm closing the issue.

LawnMo commented 7 months ago

I do think that this issue does highlight that the project needs a sane way of reverting to an older version though. I know that development moves quickly, but from any software development point of view, making the only version available through the update the current bleeding edge is somewhat crazy.

Agreed, It should be relatively simple to adapt the update_xx.(sh/bat) and start_xx.(sh/bat) to check the requirements on each start (akin to most stable-diffusion webui) instead of letting the updater scripts fetch the branch's HEAD before it updates requirements. It would fix two issues at once : make it easy to revert/check a specific commits and handle requirements change instead of booting in an unknown env some changes breaks.

oobabooga / text-generation-webui