Closed viktor-ferenczi closed 1 year ago
Tested the same GGUF file with llama.cpp in text-generation-webui (gpu_layers=44) and it worked perfectly with the exact same prompt template and prompt. It gave very good code output, so the model itself works well. For comparison:
import os
import hashlib
from collections import defaultdict
def find_duplicate_files(folder):
"""Finds duplicate files in the given folder.
Args:
folder (str): The path to the folder containing the files.
Returns:
dict: A dictionary with the file size and checksum as keys, and a list of paths to each file as values. Only includes duplicates where there are at least two files in the list.
"""
# Create a dictionary to store the results
result = defaultdict(list)
# Walk through all files in the folder and its subdirectories
for root, dirs, files in os.walk(folder):
for file in files:
# Calculate the SHA256 hash of the file
with open(os.path.join(root, file), 'rb') as f:
checksum = hashlib.sha256()
while True:
data = f.read(32 * 1024)
if not data:
break
checksum.update(data)
# Get the file size and calculate the tuple key for the dictionary
filesize = os.path.getsize(os.path.join(root, file))
key = (filesize, checksum.hexdigest())
# Add the path to the list of paths for this key in the result dictionary
result[key].append(os.path.join(root, file))
return {k: v for k, v in result.items() if len(v) > 1}
Since there are lot of tokens in prompt and response, I'm guessing it ran out of context. Can you please try increasing context_length
(default for llama is 512
):
llm = AutoModelForCausalLM.from_pretrained(..., context_length=2048)
I just want to add that there is something weird going on with llama
models in ctransformers
vs. llama.cpp
(as part of text-generation-web-ui
). With the exact same prompt and parameters like top_k
and temperature
, ctransformers
sometimes produces no output or every line of "chatting" starts with one or more emoticons - while llama.cpp
outputs very nice content. I'm really confused about why this happens. It's definitely not about context length (4096 in my case).
Emojis could be due to https://www.reddit.com/r/LocalLLaMA/comments/167ytg3/if_all_your_text_completions_start_with_an_emoji/ if your prompt ends with a trailing space.
Check if you have the latest version of ctransformers installed (pip show ctransformers
should show at least 0.2.25). Please provide a link to the model you are using and the exact settings - prompt, temperature etc. Can you also check if it is happening outside webui in a separate Python script/notebook:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/model.[bin/gguf]", model_type="llama", context_length=4096)
print(llm(prompt, top_k=..., temperature=...))
Emojis could be due to https://www.reddit.com/r/LocalLLaMA/comments/167ytg3/if_all_your_text_completions_start_with_an_emoji/ if your prompt ends with a trailing space.
Wow, that really did the trick! Looks like llama-cpp-python
is maybe stripping any trailing spaces, because it never happened there. Thanks so much for that hint. I would never have figured it out I guess.
I still have the problem that when chatting in turns with ctransformers
I end up at points where the model doesn't output anything at all with standard settings randomly. Model: llama2_7b_chat_uncensored.ggmlv3.q5_K_M.bin from https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML
This never happened with the same model inside text-generation-web-ui
. Maybe it's about the stop token \n
and that sometimes the model wants to start an output with \n
and it immediately stops? The strange thing is, that it takes a while to calculate the output and then still outputs nothing. It's not like it does it immediately. Is there a way to "verbose" what's going on?
This happens both in normal prompts and [INST]
prompts.
llm = AutoModelForCausalLM.from_pretrained(..., context_length=2048)
Thank you Marella, it worked perfectly! - Actually my bad, need to be better at diagnosing LLM issues.
OS:
Ubuntu 22.04
CUDA:11.8
GPU: 1x 4090 (24GB) Model: https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GGUF Precision:codellama-34b-instruct.Q5_K_M.gguf
Installed as:
pip install ctransformers[cuda]
Code:
Produces output with massive repetition starting at the same point:
Here is a possible solution:
It happened with repeating a few words, also happened by repeating underscore 1000 times.
I tried loading less layers on the GPU to free up memory, but it did not change anything, only make it running slower.
Also tried to tune the
repetition_penalty
between 1.0 and 10.0, but it did not fix the problem, only cut the output short.