Closed MikeRoz47 closed 8 months ago
Yep, that sounds like the same thing. Shouldn't an issue remain open for visibility until the underlying bug is resolved? I searched open issues, it didn't occur to me to search closed ones since the bug still exists in main.
Yeah feels like they closed it preemptively, it's "fixed" but not here yet
Rolling back to v0.2.31 solves the problem.
How does an end user do this? I tried cloning an old snapshot but the one click installer just changes requirements.txt and installs 0.2.38.
Find the version of llama-cpp-python from requirements.txt that's applicable to you (for your GPU/version of Python). For me, it was https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.38+cu121-cp311-cp311-win_amd64.whl
. If you're using the default miniconda environment, you're probably on Python 3.10, so you want urls with 'cp310' in them.
Use the cmd_windows/linux/macos/wsl script that's appropriate for your setup to launch a command prompt window with the appropriate conda environment activated.
Run the following command to install the old version of llama-cpp-python: pip install https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.31+cu121-cp311-cp311-win_amd64.whl
- be sure to replace the base url with the appropriate version for your setup, and then replace the version number (replace 0.2.38 with 0.2.31).
It's very kind of you to give such a meticulous explanation @MikeRoz47. Thank you very much! 👍
Thank you as well MikeRoz47. I am going to give this a try. It seems complicated for me never manually rolled back before. But I think I follow it. lol
However what I came here to ask is that I'm having a bug were I'm using Mixtral 8x7B Instruct GGUF 5_M. Some prompts look normal. But sometimes often when I ask it to write a story. It replies very poorly...
It will be like Mary was very pleased, happy, well, fine, this morning. Later on it will start repeating the same thing or almost repeating. It only started to happen when I updated just the other day. Is that possibly or likely this bug here? Other models of mine are doing it too. Thanks.
Quote: MikeRoz47 FYI, 0.2.40 seems to be in the wheels repo now. I am unable to reproduce the deterministic behavior seen with 0.2.38 so far. If you're manually working around this issue, you can upgrade to the new version (same instructions as above, just with 0.2.40 rather than 0.2.31 as your target version). Hopefully there will be an update to requirements.txt shortly, and this bug can be closed as fixed.
Great news. I'm going to make a brand new install of the whole program before I try this. So I will be keeping the current version untouched. I think I did the roll back and it's working now. I don't want to mess it up. lol
FYI, 0.2.40 seems to be in the wheels repo now. I am unable to reproduce the deterministic behavior seen with 0.2.38 so far. If you're manually working around this issue, you can upgrade to the new version (same instructions as above, just with 0.2.40 rather than 0.2.31 as your target version). Hopefully there will be an update to requirements.txt shortly, and this bug can be closed as fixed.
v0.2.42 is out. Just waiting for oobabooga to run his git actions on his windows repo: https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/ and I can test.
Should be fixed now.
Describe the bug
The sampling parameters provided on the Parameters/Generation tab seem to be partially or completely ignored since the move to v0.2.38 of llama-cpp-python. I first noticed this issue using the API, but can reproduce it from within the webui. Rolling back to v0.2.31 solves the problem. I've also tested and confirmed that the issue is present when using v0.2.39 as well.
Is there an existing issue for this?
Reproduction
Model is LLaMA2-13B-Tiefighter.Q8_0.gguf. I am running it entirely on my GPU with 4096 context. I've checked 'tensorcores'. All other model settings are default. I have seen this issue with other, larger models like goliath-120b and Mistral derivatives.
I set the sampling parameters to the following:
max_new_tokens: 512 temperature: 1 top_p: 1 min_p: 0.1 top_k: 0 repetition_penalty: 1.05 presence_penalty: 0 frequency_penalty: 0 typical_p: 1 tfs: 1 mirostat_mode: 0 mirostat_tau: 5 mirostat_eta: 0.1 seed: -1
On the notebook tab, I paste the following prompt:
I then use the 'generate' and 'regenerate' buttons to generate responses.
When using v0.2.31, I get distinctly different responses each time:
When using v0.2.38, I get one distinct response the first time, and then all subsequent responses are the same as the second one below. Perhaps the first one being different is related to issue #5434?
I discovered by accident that turning temperature up to 5 and min_p down to 0 gives me gibberish from v0.2.31:
But when I use those settings on v0.2.38, I get the same response as before. It's almost as though the parameters are being completely ignored, or there is some new setting that's overriding what's being provided by the parameters tab.
Screenshot
No response
Logs
System Info
Windows 11, nVidia RTX 4090, Python 3.11.7. I was using the default Miniconda environment when I first noticed this issue, but switched to my own venv during the process of trying to isolate the issue.
I both upgraded my original installation and created a clean installation last night, so as of commit 0f134bf.