Zero Shot Classification Pipeline gives poor results locally than online demo

nerdimite commented 3 years ago

Environment info

transformers version: 4.0.1
Platform: Colab
Python version: 3.6.9
PyTorch version (GPU?): 1.7.0 Yes
Tensorflow version (GPU?): No
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@julien-c @patrickvonplaten

Information

Model I am using (Bert, XLNet ...): facebook/bart-large-mnli

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I have a small dataset of 26 examples and I want to classify them into 2 classes. I first ran all the examples in the online demo and got around 80% accuracy.
Then I ran the code on Colab and got only 53% accuracy which I think is just a random answer between the labels.
I am aware of the fact that this issue has been opened before and resolved but it isn't working for me. (This is the previous issue)

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")

classifier = pipeline(task='zero-shot-classification', model=model, tokenizer=tokenizer)

hypothesis_template = 'This text is about {}.'
labels = ['Single Patient', 'Multiple Patient']

def predict(sequence, labels, hypothesis_template):
    results = classifier(sequence, labels,
                         hypothesis_template=hypothesis_template)
    pred_idx = np.array(results['scores']).argmax()
    pred_cls = labels[pred_idx]
    return pred_idx, pred_cls

def evaluate(dataset, labels, hypothesis_template):
    n_correct = 0
    for sequence, label in tqdm(dataset.values):
        _, pred = predict(sequence, labels, hypothesis_template)
        n_correct += (pred == label)    
    acc = n_correct / len(dataset)
    print('Accuracy:', acc)

patients = pd.read_csv('patient_classification.csv')
evaluate(patients, labels, hypothesis_template)

While loading the model I get this warning message.

Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BartForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Expected behavior

The results of the online demo and my local code (Colab) are supposed to be the same.

LysandreJik commented 3 years ago

Maybe @joeddav has an idea!

joeddav commented 3 years ago

The pipeline output is sorted from highest to lowest scores, so in your code pred_idx will always be 0 and pred_cls will always be "Single Patient". Instead you want,

pred_cls = results['labels'][0]
pred_idx = labels.index(pred_cls)

nerdimite commented 3 years ago

Oh lol, I didn't know it was that simple xD. Thanks @joeddav that increased the accuracy to 73% (though less than online demo) which is good enough. Thank you so much!

huggingface / transformers