`BloomForSequenceClassification` output is sensitive to `padding_side` and `max_length`

linhdvu14 commented 1 year ago

System Info

transformers version: 4.30.0.dev0
Platform: Linux-5.15.0-18-shopee-generic-x86_64-with-glibc2.31
Python version: 3.10.8
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

text models: @ArthurZucker and @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I found that BloomForSequenceClassification (possibly also other causal models) produces non-deterministic outputs based on max_length when tokenizer padding_side = "left".

It might be caused by this line: https://github.com/huggingface/transformers/blob/v4.30.1/src/transformers/models/bloom/modeling_bloom.py#L1080 which seems to assume right padding.

If this diagnostic is correct, imho it's quite unintuitive and error-prone, as: 1) bloom's default padding_side is left, and 2) many tutorials (e.g. peft P-tuning for sequence classification) recommend setting padding_side = "left" for causal models.

Could you provide some guidance? What's the correct way to use causal models for sequence classification?

Sample to reproduce:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, set_seed

set_seed(123)
text = "Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture."

def f(text, tokenizer, model):
    emb = tokenizer(text, return_tensors='pt')
    logits = model(**emb).logits.detach().numpy()
    print(f'no padding: {logits=}')

    for max_length in [50, 100, 200]:
        emb = tokenizer(text, padding='max_length', max_length=max_length, return_tensors='pt')
        logits = model(**emb).logits.detach().numpy()
        print(f'pad to {max_length=}: {logits=}')

# non-deterministic
def clm_left():
    pretrain = 'bigscience/bloomz-560m'
    tokenizer = AutoTokenizer.from_pretrained(pretrain)
    model = AutoModelForSequenceClassification.from_pretrained(pretrain)
    f(text, tokenizer, model)

    # >>> no padding: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=50: logits=array([[ 8.255632, 23.838833]], dtype=float32)
    # >>> pad to max_length=100: logits=array([[ 1.263773, 12.405185]], dtype=float32)
    # >>> pad to max_length=200: logits=array([[0.79204845, 8.847221  ]], dtype=float32)

# ok
def clm_right():
    pretrain = 'bigscience/bloomz-560m'
    tokenizer = AutoTokenizer.from_pretrained(pretrain)
    tokenizer.padding_side = 'right'
    model = AutoModelForSequenceClassification.from_pretrained(pretrain)
    f(text, tokenizer, model)

    # >>> no padding: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=50: logits=array([[15.1557665, 31.423962 ]], dtype=float32)
    # >>> pad to max_length=100: logits=array([[15.155769, 31.42395 ]], dtype=float32)
    # >>> pad to max_length=200: logits=array([[15.155751, 31.423967]], dtype=float32)

if __name__ == '__main__':
    clm_left()

Expected behavior

Model should produce the same outputs regardless of padding length

linhdvu14 commented 1 year ago

(bump)

ArthurZucker commented 1 year ago

Hey! Thanks for opening this issue! Seems to rather be related to this line, where we define the sequence length tensor. Most of our models that compute partial pooled logits use this. Can you try something like

            if input_ids is not None:
                sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).long().argmax(-1) - 1).to(logits.device)

I'll open a PR to fix it!

linhdvu14 commented 1 year ago

Thanks @ArthurZucker, the fix works great.

Seems the PR misses a few models: biogpt, bloom, falcon, mpt.

ArthurZucker commented 1 year ago

There was a follow up PR: #25085, might have forgotten other models!

huggingface / transformers