The output diverges in comparison to the Python implementation.

hugoabonizio commented 6 months ago

I've noticed that the generation diverges after some tokens in comparison to the HF implementation. Is this expected?

Here's how to reproduce:

Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

prompt = 'hi, my name is'
max_tokens = 50

model_path = 'mistralai/Mistral-7B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16).to('cuda')
tokens = tokenizer(prompt, return_tensors='pt').to('cuda')
print(
    tokenizer.decode(
        model.generate(
            **tokens,
            max_new_tokens=max_tokens,
            do_sample=False,
            temperature=0.0,
            top_p=1.0,
        )[0],
        skip_special_tokens=True,
    )
)

Generates:

hi, my name is john and i'm a recovering alcoholic.

i've been sober for 10 years now.

i'm not sure if i'm ready to share this with you, but i'm going to

Candle

$ cargo run --example mistral --features cuda -- --model-id mistralai/Mistral-7B-v0.1 --sample-len 50 --temperature 0.0 --top-p 1.0 --prompt "hi, my name is"

Generates:

hi, my name is john and i'm a recovering alcoholic.

i've been sober for 10 years now.

i was in the military for 20 years and retired as an E-7.

i have

LaurentMazare commented 6 months ago

I feel that this is not an unexpected behavior, even with the temperature set to 0. The tricky bit here is numerical stability, some of the cuda algorithm may be non deterministic but even besides this candle and pytorch don't apply the exact same ops, e.g. we accumulate with f32 in the softmax whereas pytorch may well do something slightly different. Overall as the generated text seems legit I would think that it's fine but I would not consider that the generation or the generated logits should line up perfectly.

hugoabonizio commented 6 months ago

It makes sense! I was suspecting that. My concern arises from consistently lower performance in my internal benchmarks (average across ~17 datasets), where it scores 1% to 2% lower than the reference implementation (Python) on all tested models. However, I suppose there's no easy solution for that.

LaurentMazare commented 6 months ago

That's interesting, what is the benchmark, MMLU or something else? For MMLU 1 or 2% seems with noise but it's a bit annoying if it's consistently worse, might be good to measure perplexity if that's not already what you're doing. Overall numerical differences can lead to lower performance as pytorch will be consistent between training and inference and we wouldn't be but it's hard to say by how much, so any number you can put on this would be greatly appreciated (it may certainly be a bug on the candle side).

jorgeantonio21 commented 6 months ago

This is actually an interesting topic. Thanks for sharing it @hugoabonizio. Even though the numerical imprecision being naturally present in different implementations, I would expect these differences to be minimal and therefore have no impact in the actual token generation (the probabilities precision might differ slightly, but the sampled token should be the same assuming one fixes the random seed for sampling. Any thoughts on this @LaurentMazare @hugoabonizio ?

hugoabonizio commented 6 months ago

@LaurentMazare Unfortunately, this result is based on an internal benchmark suite and not all datasets are public. However, I'll make an effort to attempt the same kind of evaluation using public datasets to make it reproducible.

@jorgeantonio21 I wouldn't expect sampling to be equal because there are a lot of factors affecting the sampling process that differ. However, in greedy sampling, I was expecting the results to be the same since the output probabilities should be (hopefully) the same.

huggingface / candle

The output diverges in comparison to the Python implementation. #2031