Code Llama 34B GGUF produces garbage after a certain point

viktor-ferenczi commented 1 year ago

OS: Ubuntu 22.04 CUDA: 11.8 GPU: 1x 4090 (24GB) Model: https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GGUF Precision: codellama-34b-instruct.Q5_K_M.gguf

Installed as: pip install ctransformers[cuda]

Code:

import os.path
from ctransformers import AutoModelForCausalLM

# MODEL_BIN_PATH, GPU_LAYERS = os.path.expanduser(r'~/models/codellama-34b-instruct.Q4_K_M.gguf'), 99
MODEL_BIN_PATH, GPU_LAYERS = os.path.expanduser(r'~/models/codellama-34b-instruct.Q5_K_M.gguf'), 48  # BEST for 24GB GPU
# MODEL_BIN_PATH, GPU_LAYERS = os.path.expanduser(r'~/models/codellama-34b-instruct.Q5_K_S.gguf'), 48
# MODEL_BIN_PATH, GPU_LAYERS = os.path.expanduser(r'~/models/codellama-34b-instruct.Q8_0.gguf'), 32

llm = AutoModelForCausalLM.from_pretrained(MODEL_BIN_PATH, model_type='llama', gpu_layers=GPU_LAYERS, temperature=0.5)

prompt = '''\
You are an expert Python developer. Your task is to write a Python 3 function to identify duplicate files in a folder and return a summary of them.

Requirements:
- At any depth in the subdirectory structure.
- Two files are duplicates if they have the same size and contents.
- File contents can be checked based on their SHA256 hashes (checksums).
- Do not read whole files into memory, calculate the hash in 32kB chunks.
- The risk of a hash collision is acceptable in this use case.
- Must find all duplicate files.
- Must NOT delete any files.
- The return value of the function must be a dictionary. The key must be the tuple of (file_size, checksum), values are the list of paths. Returns ONLY the duplicates, where there are at least two files in the list.
- The solution must work on both Windows and UNIX (Linux, MAC).
- Do not calculate the checksum of files with a unique size, because they cannot be duplicates.

Further instructions:
- Add only very concise comments into the code wherever it is absolutely necessary.
- Keep the code in each function short and as simple as possible.
- Avoid deep nesting of flow control.
- Avoid assigning variables which are not used afterwards.
- Structure the code to be very easy to read and understand by humans.
- Add type hints to all function parameters, return values and variables.
- Provide only the code and nothing else.
- You are an expert developer, you can code this simple task very well.
'''

in_text = f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ``{''}`:
{prompt}
[/INST]'''
print(in_text)

out_text = llm(in_text)
print(out_text)

Produces output with massive repetition starting at the same point:

Here is a possible solution:

import os
import hashlib

def find_duplicate_files(folder):
    """
    Find duplicate files in a folder and return a summary of them.

    Parameters
    ----------
    folder : str
        The path to the folder to search for duplicates.

    Returns
    -------
    dict
        A dictionary where the keys are tuples of (file_size, checksum) and the values are lists of paths to files with that size and checksum.
    """
    """
<<<20 more lines of """ >>>
    """
    """

It happened with repeating a few words, also happened by repeating underscore 1000 times.

I tried loading less layers on the GPU to free up memory, but it did not change anything, only make it running slower.

Also tried to tune the repetition_penalty between 1.0 and 10.0, but it did not fix the problem, only cut the output short.

viktor-ferenczi commented 1 year ago

Tested the same GGUF file with llama.cpp in text-generation-webui (gpu_layers=44) and it worked perfectly with the exact same prompt template and prompt. It gave very good code output, so the model itself works well. For comparison:

import os
import hashlib
from collections import defaultdict

def find_duplicate_files(folder):
    """Finds duplicate files in the given folder.

    Args:
        folder (str): The path to the folder containing the files.

    Returns:
        dict: A dictionary with the file size and checksum as keys, and a list of paths to each file as values. Only includes duplicates where there are at least two files in the list.
    """
    # Create a dictionary to store the results
    result = defaultdict(list)

    # Walk through all files in the folder and its subdirectories
    for root, dirs, files in os.walk(folder):
        for file in files:
            # Calculate the SHA256 hash of the file
            with open(os.path.join(root, file), 'rb') as f:
                checksum = hashlib.sha256()
                while True:
                    data = f.read(32 * 1024)
                    if not data:
                        break
                    checksum.update(data)
            # Get the file size and calculate the tuple key for the dictionary
            filesize = os.path.getsize(os.path.join(root, file))
            key = (filesize, checksum.hexdigest())

            # Add the path to the list of paths for this key in the result dictionary
            result[key].append(os.path.join(root, file))

    return {k: v for k, v in result.items() if len(v) > 1}

marella commented 1 year ago

Since there are lot of tokens in prompt and response, I'm guessing it ran out of context. Can you please try increasing context_length (default for llama is 512):

llm = AutoModelForCausalLM.from_pretrained(..., context_length=2048)

m-from-space commented 1 year ago

I just want to add that there is something weird going on with llama models in ctransformers vs. llama.cpp (as part of text-generation-web-ui). With the exact same prompt and parameters like top_k and temperature, ctransformers sometimes produces no output or every line of "chatting" starts with one or more emoticons - while llama.cpp outputs very nice content. I'm really confused about why this happens. It's definitely not about context length (4096 in my case).

marella commented 1 year ago

Emojis could be due to https://www.reddit.com/r/LocalLLaMA/comments/167ytg3/if_all_your_text_completions_start_with_an_emoji/ if your prompt ends with a trailing space.

Check if you have the latest version of ctransformers installed (pip show ctransformers should show at least 0.2.25). Please provide a link to the model you are using and the exact settings - prompt, temperature etc. Can you also check if it is happening outside webui in a separate Python script/notebook:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/model.[bin/gguf]", model_type="llama", context_length=4096)

print(llm(prompt, top_k=..., temperature=...))

m-from-space commented 1 year ago

Emojis could be due to https://www.reddit.com/r/LocalLLaMA/comments/167ytg3/if_all_your_text_completions_start_with_an_emoji/ if your prompt ends with a trailing space.

Wow, that really did the trick! Looks like llama-cpp-python is maybe stripping any trailing spaces, because it never happened there. Thanks so much for that hint. I would never have figured it out I guess.

I still have the problem that when chatting in turns with ctransformers I end up at points where the model doesn't output anything at all with standard settings randomly. Model: llama2_7b_chat_uncensored.ggmlv3.q5_K_M.bin from https://huggingface.co/TheBloke/llama2_7b_chat_uncensored-GGML

This never happened with the same model inside text-generation-web-ui. Maybe it's about the stop token \n and that sometimes the model wants to start an output with \n and it immediately stops? The strange thing is, that it takes a while to calculate the output and then still outputs nothing. It's not like it does it immediately. Is there a way to "verbose" what's going on?

This happens both in normal prompts and [INST] prompts.

viktor-ferenczi commented 1 year ago

llm = AutoModelForCausalLM.from_pretrained(..., context_length=2048)

Thank you Marella, it worked perfectly! - Actually my bad, need to be better at diagnosing LLM issues.

marella / ctransformers

Code Llama 34B GGUF produces garbage after a certain point #121