unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.18k stars 1.01k forks source link

Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

Open ziemowit-s opened 5 months ago

ziemowit-s commented 5 months ago

Hi there,

after loading the model with:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

tokenizer.padding_side='left'
tokenizer.pad_token = tokenizer.eos_token

I performed a batch inference:

instructs = []

for r in texts:
    summary_inst = f"""Provide a very short summary of the text: {r}"""

    chat = [
        {"role": "user", "content": summary_inst},
    ]

    txt = tokenizer.apply_chat_template(chat, tokenize=False)
    instructs.append(txt)

inputs = tokenizer(instructs, return_tensors = "pt", padding=True).to("cuda")
response = model.generate(**inputs, max_new_tokens = 512, do_sample=False).cpu().numpy()

raw_txts = tokenizer.batch_decode(response, skip_special_characters=True)
response = [rr.split("[/INST]")[-1].replace("</s>", "") for rr in raw_txts]

The received answer is nonsensical, but since it consists of 3 elements, and the second is the longest - this one is the only correct one, the other two are nonsensical. When I reduce all the texts (to a maximum of 3000 characters) - all the answers return to normal. It also works well when I infer each one in turn.

texts.txt nonsense_texts.txt

The texts to generate the summary are attached as texts.txt, and the nonsense answers are in the file nonsense_texts.txt (3 entries are separated by the <END> tag) to be reproduced, below is an example of a nonsense answer:

' The , 1200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000',
danielhanchen commented 5 months ago

@ziemowit-s I'll check this out! Sorry on the issue!

ziemowit-s commented 5 months ago

Don't worry, it's a relatively new library so bugs are expected :)

its5Q commented 5 months ago

Hey, just want to confirm, I have an exact same issue with my Llama model. Inference on single samples works fine, but produces garbage on batches of multiple samples. I'm loading my model in bfloat16 without quantization.

danielhanchen commented 5 months ago

@ziemowit-s @its5Q Apologies on the issues again :( Still debugging stuff so sorry on that!

danielhanchen commented 5 months ago

Actually can confirm - batched inference in fact is breaking - I'm working on a fix asap - sorry for the wait guys!

danielhanchen commented 5 months ago

@ziemowit-s @its5Q Much apologies on the delay - I temporarily fixed it by disabling Unsloth's fast inference paths - it seems like I need to dig deeper on why this is happening :( Using pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git will get the temporary fix.

Again sorry for the inconvience!

danielhanchen commented 5 months ago

@ziemowit-s @its5Q I think I finally fixed it!! On the example @ziemowit-s provided me:

[' The text emphasizes the benefits of humor in the healing process, including reducing stress, improving mood, and boosting the immune system. It suggests strategies such as seeking out humor that resonates, finding humor in everyday situations, sharing a laugh with others, using humor as a coping mechanism, and being gentle with oneself. The text also encourages taking things one step at a time and seeking support when needed.',
 ' The text discusses various causes for memory and concentration issues beyond anxiety, including nutritional deficiencies, sleep deprivation, chronic stress, medications, medical conditions, substance abuse, brain injuries, and aging. Daniella shares her experiences of anxiety, stress, skipping meals, and lack of sleep. Irvin suggests prioritizing self-care, relaxation techniques, and speaking with a therapist to manage stress and memory issues. Daniella expresses concerns about the cost and time commitment of adding new treatments to her therapy sessions. Irvin emphasizes the importance of investing in mental health and encourages Daniella to consider speaking with her therapist about her concerns.',
 ' Dissociative disorders involve alterations in consciousness, memory, identity, or perception, and can include feelings of worthlessness and isolation due to detachment from self and others. These symptoms should be discussed with a therapist for proper diagnosis and treatment. While feelings of worthlessness and isolation are common, they may indicate an underlying mental health condition. Reach out for help and support if these feelings persist and interfere with daily life.',
 ' Board games can help manage anxiety by providing distraction, social interaction, problem-solving, relaxation, and fun. The benefits may vary for individuals, so experimenting with different types of games is recommended.']

Single inference again is faster - batched similar speed for now. Use install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git to update on local machines (Colab / Kaggle no need)

its5Q commented 5 months ago

Awesome, I'll test it as soon as I get to it

its5Q commented 5 months ago

Tried it myself and I'm getting the same weird output as before. One thing that I've noticed is that the weird output only comes from the samples that are padded, and the longest prompt in the batch produces normal output. If all the samples in the batch are the same length in tokens, thus no padding is required, the model output for all samples is as to be expected. Using unsloth from commit d3a33a0dc3cabd3b3c0dba0255fb4919db44e3b5

danielhanchen commented 5 months ago

@its5Q That's very weird :( For me it seems to work perfectly. I have an example if you can run this:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

inputs = [
    "Create a Python program using Pytorch to create a simple neural network for image classification.\n"\
    "You need to do the data preparation step, the training step, and the inference step as well.",

    "Create a Python program to compute all the primes.",

    "Write a long essay about happiness, and how to attain it. Provide clear markdown sections.",

    "20*20=?",
]

tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "left"
inputs = tokenizer(inputs, return_tensors = "pt", padding = True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 512, do_sample = False, use_cache = True)

decoded = tokenizer.batch_decode(outputs)
for text in decoded:
    print(text.replace(tokenizer.pad_token, ""))
    print("_" * 70)

You will get:

<s> Create a Python program using Pytorch to create a simple neural network for image classification.
You need to do the data preparation step, the training step, and the inference step as well.

Here's a simple example of a neural network for image classification using PyTorch. This example uses the MNIST dataset, which consists of 60,000 28x28 grayscale images of digits 0-9.

First, let's install the required packages:

```bash
pip install torch torchvision

Now, let's write the code:

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim

# Load and normalize the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)

# Define the neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(),
______________________________________________________________________
<s> Create a Python program to compute all the primes.

Here's a simple Python program to find all prime numbers up to a given limit:

```python
def is_prime(n):
    """
    Check if a number is prime.
    """
    if n <= 1:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_primes(limit):
    """
    Find all prime numbers up to a given limit.
    """
    primes = []
    for n in range(2, limit + 1):
        if is_prime(n):
            primes.append(n)
    return primes

if __name__ == "__main__":
    limit = int(input("Enter the limit: "))
    primes = find_primes(limit)
    print(f"Prime numbers up to {limit}:")
    print(primes)

This program uses two functions: is_prime() to check if a number is prime, and find_primes() to find all prime numbers up to a given limit. The main part of the code is in the if __name__ == "__main__": block, where it takes user input for the limit and then prints out the prime numbers found.


Write a long essay about happiness, and how to attain it. Provide clear markdown sections.

Happiness: The Elusive Pursuit

Happiness is a concept that has puzzled philosophers, theologians, and ordinary people for centuries. It is a state of well-being and contentment, a feeling of joy and satisfaction with life. Yet, despite its importance, happiness remains an elusive and subjective experience. In this essay, we will explore the nature of happiness, its sources, and the ways to attain it.

The Nature of Happiness

Happiness is a complex and multifaceted experience. It is not a static state, but rather a dynamic process that ebbs and flows throughout our lives. Happiness is not the absence of suffering or hardship, but rather the ability to find meaning and joy in the midst of challenges. It is a state of mind that is shaped by our thoughts, emotions, and actions.

The Role of Thoughts

Our thoughts play a significant role in shaping our experience of happiness. The way we think about ourselves, our circumstances, and the world around us can either enhance or diminish our sense of well-being. For example, focusing on the negative aspects of a situation can lead to feelings of sadness and frustration, while focusing on the positive can lead to feelings of gratitude and joy.

The Role of Emotions

Emotions are another important factor in our experience of happiness. Positive emotions such as joy, love, and gratitude can enhance our sense of well-being, while negative emotions such as anger, sadness, and fear can detract from it. However, it is important to note that emotions are not static states, but rather transient experiences that come and go.

The Role of Actions

Our actions also play a role in our experience of happiness. Engaging in activities that bring us joy and fulfillment, such as pursuing a hobby or spending time with loved ones, can enhance our sense of well-being. Conversely, engaging in activities that are harmful or detrimental to our health and happiness, such as substance abuse or excessive work, can detract from it.

The Sources of Happiness

Despite the complexity of happiness, there are certain sources that have been identified as contributing to our sense of well-being.

Relationships

Relationships with others are a fundamental source of happiness. Human


20*20=?

The answer to this question is 400. The multiplication of 20 by itself results in 400. The number 20 is multiplied by itself 20 times, resulting in a total of 40,000. However, since the question asks for the result of 20 multiplied by itself 20 times, we need to find the result of multiplying 20 by itself 20 times and then take the square root of that number to get the final answer of 400.

Here's the step-by-step calculation:

  1. Multiply 20 by itself 20 times: 20 × 20 × 20 × ... × 20 (20 multiplications)
  2. Calculate the product: 20^20 = 40,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

If you do them individually, I get:

<s> Create a Python program using Pytorch to create a simple neural network for image classification.
You need to do the data preparation step, the training step, and the inference step as well.

Here's a simple example of a neural network for image classification using PyTorch. This example uses the MNIST dataset, which consists of 60,000 28x28 grayscale images of digits 0-9.

First, let's install the required packages:

```bash
pip install torch torchvision

Now, let's write the code:

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim

# Load the MNIST dataset
transform = transforms.ToTensor()

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)

# Define the neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Training the neural network
for epoch in range
______________________________________________________________________

Some differences via https://www.diffchecker.com/text-compare/: image

The 2nd one:

<s> Create a Python program to compute all the primes.

Here's a simple Python program to find all prime numbers up to a given limit:

```python
def is_prime(n):
    """
    Check if a number is prime.
    """
    if n <= 1:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_primes(limit):
    """
    Find all prime numbers up to a given limit.
    """
    primes = []
    for n in range(2, limit + 1):
        if is_prime(n):
            primes.append(n)
    return primes

if __name__ == "__main__":
    limit = int(input("Enter the limit: "))
    primes = find_primes(limit)
    print(f"Prime numbers up to {limit}:")
    print(primes)

This program uses two functions: is_prime() to check if a number is prime, and find_primes() to find all prime numbers up to a given limit. The main part of the code is in the if __name__ == "__main__": block, where it takes user input for the limit and then prints out the prime numbers found.



No difference on 2nd.
On 3rd:

Write a long essay about happiness, and how to attain it. Provide clear markdown sections.

Happiness: The Ultimate Goal

Happiness is a state of well-being and contentment. It is the feeling of joy, satisfaction, and fulfillment. Happiness is the ultimate goal of every human being. It is what we all strive for, whether consciously or unconsciously.

The Importance of Happiness

Happiness is important for several reasons. First and foremost, it makes life worth living. Without happiness, life can seem meaningless and unfulfilling. Happiness gives us a sense of purpose and direction. It helps us to appreciate the good things in life and to cope with the challenges and hardships that come our way.

Second, happiness is good for our health. Research has shown that happy people are healthier and more resilient than unhappy people. They have stronger immune systems, they recover from illness faster, and they live longer.

Third, happiness is good for our relationships. Happy people are more likely to have strong, healthy relationships with others. They are better able to communicate effectively, to forgive and to be forgiven, and to show love and compassion.

Fourth, happiness is good for our productivity and creativity. Happy people are more productive and creative than unhappy people. They are more focused, more motivated, and more innovative.

The Pursuit of Happiness

Despite the many benefits of happiness, it can be elusive. Many people spend their entire lives searching for happiness, only to find that it always seems just out of reach. So how can we attain happiness?

1. Cultivate a Positive Attitude

One of the most effective ways to cultivate happiness is to cultivate a positive attitude. This means focusing on the good things in life, rather than the bad. It means looking for the silver lining in every situation, and finding ways to turn negatives into positives.

2. Practice Gratitude

Another effective way to cultivate happiness is to practice gratitude. This means being thankful for what we have, rather than focusing on what we don't have. It means appreciating the small things in life, and being grateful for the people and things that make our lives richer and more meaningful.

3. Build Strong Relationships

Strong relationships with others are essential for happiness. This



Very different to single decoding, but both are still coherent: 
![image](https://github.com/unslothai/unsloth/assets/23090290/8c0c32ac-7e3b-4893-89e5-5fbfc00f567b)
This is because I use `torch.nn.functional.softmax` for single decoding and `torch.nn.functional.scaled_dot_product_attention` for multi decoding

And finally:

20*20=?

The answer to this question is 400. The multiplication of 20 by itself results in 400. The number 20 is multiplied by itself 20 times, resulting in a total of 40,000. However, since the question asks for the result of 20 multiplied by itself 20 times, we need to find the result of multiplying 20 by itself 20 times and then take the square root of that number to get the final answer of 400.

Here's the step-by-step calculation:

  1. Multiply 20 by itself 20 times: 20 × 20 × 20 × ... × 20 (20 multiplications)
  2. Calculate the product: 20^20 = 40,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

0 differences as well - the reasoning though is dumb lol

danielhanchen commented 5 months ago

Also @its5Q you need to use padding_side = "left" or else the results will be wrong

its5Q commented 5 months ago

Also @its5Q you need to use padding_side = "left" or else the results will be wrong

Oh yeah, that the problem, thanks. Now batched inference works as expected for me.

danielhanchen commented 5 months ago

@its5Q im thinking if somehow I can default it to left, since people have said this was an ongoing issue!

JIBSIL commented 5 months ago

0 differences as well - the reasoning though is dumb lol

wouldn't the difference be due to calculating a random seed each generation? Therefore generations would be different even when comparing non-batched with non-batched

JIBSIL commented 5 months ago

@its5Q im thinking if somehow I can default it to left, since people have said this was an ongoing issue!

I'm not an expert in the transformers/unsloth code, but couldn't you just add a line of code before return model, tokenizer with tokenizer.padding_side = "left"?

danielhanchen commented 5 months ago

@JIBSIL Oh if you select do_sample = False there is no randomness involved. On the left issue - the issue is for training, this makes training more complex, and Unsloth was primarily a training library, hence the reason why the padding is right.

JIBSIL commented 5 months ago

@JIBSIL Oh if you select do_sample = False there is no randomness involved. On the left issue - the issue is for training, this makes training more complex, and Unsloth was primarily a training library, hence the reason why the padding is right.

Ah, thanks for the clarification. However, in the newest release, I am encountering a different error:

File /opt/conda/lib/python3.10/site-packages/unsloth/models/gemma.py:148, in GemmaModel_fast_forward_inference(self, input_ids, past_key_values, position_ids, attention_mask)
    146 seq_len = past_key_values[0][0].shape[-2]
    147 if bsz != 1:
--> 148     attention_mask = _prepare_4d_causal_attention_mask(attention_mask, (bsz, q_len), hidden_states, seq_len,)
    149 pass
    151 next_decoder_cache = []

NameError: name '_prepare_4d_causal_attention_mask' is not defined

Specifically using Gemma-7b. But as usual, mistral works fine 🤣

danielhanchen commented 5 months ago

@its5Q Whoops you're correct! I decided to just run the notebook - I 100% finally fixed it now oh lord so sorry!!! The issue of multiple model supports :(

image