Gemma-7b is not working properly. There is a logical bug somewhere.

alisafaya commented 6 months ago

Reopening issue about gemma-7b prediction values.

This issue is still not solved: The perplexity values of gemma-2b and gemma-7b (much worse, near random) are very different. Wikitext-v2 token perplexity for gemma-2b ~= 21. For gemma-7b it is a very large value ~= 1e13.

Not sure of the reason, but it does have to be a problem with the implementation, it might be because of the weights, or some embedding/tokenizer mismatch.

Originally posted by @alisafaya in https://github.com/huggingface/transformers/issues/29181#issuecomment-1961539845

ArthurZucker commented 6 months ago

import torch
from tqdm import tqdm

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "mps"
model_id = "google/gemma-7b"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer(test["text"], add_special_tokens=False) # use tokenizer parallelism
encodings.input_ids = torch.tensor([sum(encodings.input_ids, [])])
max_length = model.config.max_position_embeddings
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    input_ids[:, 0] = 2 # give a bos token
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

    ppl = torch.exp(torch.stack(nlls).mean())
    print(ppl, torch.exp(neg_log_likelihood))

I am getting ~1 taken from this tuto. The idea is that the model needs a to be passed a bos_token ( thanks @alisafaya for the edit)

SyphonArch commented 6 months ago

A perplexity of 1 cannot possibly be correct - I observe a similar discrepancy between the 2B and 7B models and suspect that the issue is not BOS related.

ArthurZucker commented 6 months ago

Going from 1e13 to 1 seems pretty good already no?

ArthurZucker commented 6 months ago

Note that I am using 8K context with a stride of 512, it might not be the same setup as you 😉

ArthurZucker commented 6 months ago

@alisafaya should we close this?

alisafaya commented 6 months ago

This is not related to the context size. Perplexity values close to 1.0 means that the loss value is close to 0. I checked the script you shared, and it has small bug.

input_ids[0] = 2 # give a bos token

Converts the whole input into a sequence of:

<bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos>

This is the reason of the very low perplexity. It should be:

input_ids[:, 0] = 2 # give a bos token

The main issue seems to be related to the bos token. I identified two main issues:

The 2B version works fine regardless of the bos-token. Whereas the 7B does not work unless bos-token is present.
The 2B version does not support 8192 context size. It works with 4096 fine. Did not try other values.

I updated the script as follows:

import torch
from tqdm import tqdm

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_id = "google/gemma-7b"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

from datasets import load_dataset

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer(test["text"], add_special_tokens=False) # use tokenizer parallelism
encodings.input_ids = torch.tensor([sum(encodings.input_ids, [])])

max_length = 4096 
stride = 2048

seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    input_ids[:, 0] = 2 # bos token
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)
    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())
print("Perplexity:", ppl)

Now the token perplexity:

gemma-7b = 6.1250
gemma-2b = 7.7500
no-bos: gemma-2b = 8.1250
no-bos: gemma-7b = 8.0111e+08

This should be added to the documentation or fixed somehow in the configuration files. After that we can close this issue.

ArthurZucker commented 6 months ago

Oups yeah I fixed locally forgot to upstream! The bos is added in the tokenizer as add_bos_token is set to True by default. Do you want to add a TIP for the 7b model in the doc?

alisafaya commented 6 months ago

I am not sure how would huggingface address such an issue, but the users should be warned somehow.

ArthurZucker commented 6 months ago

It's very specific to Gemma and more so to gemma-7b . We can have the tokenizer warning users if bos_token is not set, otherwise just a tip / warning in the gemma.md should be good.

add_special_tokens=False is the user disabling something

ArthurZucker commented 6 months ago

Do you want to open a PR to update the doc? 🤗

ArthurZucker commented 6 months ago

( would be nice to fix the pipeline that does not add the special tokens 😢 )

vince62s commented 6 months ago

@alisafaya sorry to hijack your post but just in case you're interested, wikitext is tokenized so you need to detokenize first using this methid here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wikitext/preprocess_wikitext.py#L4 then it will be tokenized by the HF tokenizer. PPL should differ a bit.

alisafaya commented 6 months ago

Thanks for the note. This will be helpful to the reader.

My purpose is only demonstration. I am using a different dataset internally.

On Wed, Feb 28, 2024, 21:03 Vincent Nguyen @.***> wrote:

@alisafaya https://github.com/alisafaya sorry to hijack your post but just in case you're interested, wikitext is tokenized so you need to detokenize first using this methid here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/wikitext/preprocess_wikitext.py#L4 then it will be tokenized by the HF tokenizer. PPL should differ a bit.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/29250#issuecomment-1969549536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFK4JSK2G6G2R32EQ4PB56TYV5WP7AVCNFSM6AAAAABDW5MNN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGU2DSNJTGY . You are receiving this because you were mentioned.Message ID: @.***>

vince62s commented 6 months ago

using the exact same setup do you have the number for mistral7b and llama2-7B ?

alisafaya commented 6 months ago

No, I do not.

Btw, token perplexity is not directly comparable across models with different tokenizers.

I advise using bits-per-char or negative log likelihood per character. (Sum total loss over the whole test set and averaging per number of characters or bytes.)

For reference check the appendix of the Megatron blog here: https://nv-adlr.github.io/MegatronLM

On Wed, Feb 28, 2024, 21:09 Vincent Nguyen @.***> wrote:

using the exact same setup do you have the number for mistral7b and llama2-7B ?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/29250#issuecomment-1969559914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFK4JSOAZGNVPTNPC3U5LRLYV5XHLAVCNFSM6AAAAABDW5MNN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRZGU2TSOJRGQ . You are receiving this because you were mentioned.Message ID: @.***>

vince62s commented 6 months ago

yes you're right, I'll run it through lm_eval to check (it does spit out per word ppl, as well as character)

ArthurZucker commented 5 months ago

Also we patched a few things ever since!

vince62s commented 5 months ago

I ran lm_eval / wikitext task, it's going OOM with regular max_length of 4096 (on a 24GB card) With 512 I am getting:

hf (pretrained=google/gemma-7b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	42455038.3994	±	N/A
		none	None	byte_perplexity	26.6969	±	N/A
		none	None	bits_per_byte	4.7386	±	N/A

hf (pretrained=google/gemma-7b-it,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	1795.5652	±	N/A
		none	None	byte_perplexity	4.0602	±	N/A
		none	None	bits_per_byte	2.0216	±	N/A

hf (pretrained=google/gemma-7b,max_length=256), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	41037962.2523	±	N/A
		none	None	byte_perplexity	26.5280	±	N/A
		none	None	bits_per_byte	4.7294	±	N/A

hf (pretrained=google/gemma-2b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	55.9289	±	N/A
		none	None	byte_perplexity	2.1223	±	N/A
		none	None	bits_per_byte	1.0857	±	N/A

ArthurZucker commented 5 months ago

Are you adding the BOS to every single input you pass to the model?

vince62s commented 5 months ago

Are you adding the BOS to every single input you pass to the model?

indeed there is an issue in the rolling likelihood of lm_eval (but IMO it should be added transparently in the gemma-7b modelling)

ArthurZucker commented 5 months ago

No, that's is not the way we do it in transformers. It is automatically added by the tokenizer tho, since add_bos is set to True by default. We "usually" avoid touching the inputs in the modeling

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

Gemma-7b is not working properly. There is a logical bug somewhere. #29250