huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.75k stars 26.95k forks source link

`BloomForSequenceClassification` output is sensitive to `padding_side` and `max_length` #24265

Closed linhdvu14 closed 1 year ago

linhdvu14 commented 1 year ago

System Info

Who can help?

text models: @ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

I found that BloomForSequenceClassification (possibly also other causal models) produces non-deterministic outputs based on max_length when tokenizer padding_side = "left".

It might be caused by this line: https://github.com/huggingface/transformers/blob/v4.30.1/src/transformers/models/bloom/modeling_bloom.py#L1080 which seems to assume right padding.

If this diagnostic is correct, imho it's quite unintuitive and error-prone, as: 1) bloom's default padding_side is left, and 2) many tutorials (e.g. peft P-tuning for sequence classification) recommend setting padding_side = "left" for causal models.

Could you provide some guidance? What's the correct way to use causal models for sequence classification?

Sample to reproduce:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, set_seed

set_seed(123)
text = "Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture."

def f(text, tokenizer, model):
    emb = tokenizer(text, return_tensors='pt')
    logits = model(**emb).logits.detach().numpy()
    print(f'no padding: {logits=}')

    for max_length in [50, 100, 200]:
        emb = tokenizer(text, padding='max_length', max_length=max_length, return_tensors='pt')
        logits = model(**emb).logits.detach().numpy()
        print(f'pad to {max_length=}: {logits=}')

# non-deterministic
def clm_left():
    pretrain = 'bigscience/bloomz-560m'
    tokenizer = AutoTokenizer.from_pretrained(pretrain)
    model = AutoModelForSequenceClassification.from_pretrained(pretrain)
    f(text, tokenizer, model)

    # >>> no padding: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=50: logits=array([[ 8.255632, 23.838833]], dtype=float32)
    # >>> pad to max_length=100: logits=array([[ 1.263773, 12.405185]], dtype=float32)
    # >>> pad to max_length=200: logits=array([[0.79204845, 8.847221  ]], dtype=float32)

# ok
def clm_right():
    pretrain = 'bigscience/bloomz-560m'
    tokenizer = AutoTokenizer.from_pretrained(pretrain)
    tokenizer.padding_side = 'right'
    model = AutoModelForSequenceClassification.from_pretrained(pretrain)
    f(text, tokenizer, model)

    # >>> no padding: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=50: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=100: logits=array([[15.155769, 31.42395 ]], dtype=float32)
    # >>> pad to max_length=200: logits=array([[15.155751, 31.423967]], dtype=float32)

if __name__ == '__main__':
    clm_left()

Expected behavior

Model should produce the same outputs regardless of padding length

linhdvu14 commented 1 year ago

(bump)

ArthurZucker commented 1 year ago

Hey! Thanks for opening this issue! Seems to rather be related to this line, where we define the sequence length tensor. Most of our models that compute partial pooled logits use this. Can you try something like

            if input_ids is not None:
                sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).long().argmax(-1) - 1).to(logits.device)

I'll open a PR to fix it!

linhdvu14 commented 1 year ago

Thanks @ArthurZucker, the fix works great.

Seems the PR misses a few models: biogpt, bloom, falcon, mpt.

ArthurZucker commented 1 year ago

There was a follow up PR: #25085, might have forgotten other models!