DeBERTa models produce nonsense fill-mask output

mawilson1234 commented 1 year ago

System Info

Python version: 3.8.15 Transformers version: 4.24.0

Who can help?

@ArthurZucker, @younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Both on the HF website and using transformers in Python scripts/interpreter, the DeBERTa models seem to produce nonsense outputs in a fill-mask task. This is demonstrated below using a fill-mask pipeline for ease of reproduction, but the same thing happens even when calling the models manually and inspecting the logits. I demonstrate with one model, but the other microsoft/deberta masked language models appear to have the same issue (i.e., not the ones fine-tuned on mnli or whatever, which I wouldn't test against).

>>> from transformers import pipeline
>>> test_sentence = 'Do you [MASK] the muffin man?'

# for comparison
>>> bert = pipeline('fill-mask', model = 'bert-base-uncased')
>>> print('\n'.join([d['sequence'] for d in bert(test_sentence)]))
do you know the muffin man?
do you remember the muffin man?
do you mean the muffin man?
do you see the muffin man?
do you recognize the muffin man?

>>> deberta = pipeline('fill-mask', model = 'microsoft/deberta-v3-large')
>>> print('\n'.join([d['sequence'] for d in deberta(test_sentence)]))
Do you Moisturizing the muffin man?
Do you Kagan the muffin man?
Do youULA the muffin man?
Do you闘 the muffin man?
Do you aplica the muffin man?

Here's a screenshot from the HF website for the same model (microsoft/deberta-v3-large): deberta

Based on the paper and the documentation on the model cards, it seems like these should be able to be used for masked language modeling out of the box since they were pre-trained on it, but they're clearly not doing a good job of it. Am I missing something about why these models shouldn't be used for MLM without fine-tuning, or is there a bug with them?

Expected behavior

I'd expect sensible predictions for masked token locations (assuming these models can indeed be used for that without additional fine-tuning).

NiVisser commented 1 year ago

Hey! Did you find a solution/cause yet? I am experiencing the same issues on debertav3-base even though I pretrained the model on my own training data...

mawilson1234 commented 1 year ago

No dice, but I discovered the problem is worse than than just mask filling; it doesn't even produce the right thing for given tokens.

>>> import torch
>>> from transformers import AutoModelForMaskedLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('deberta-v3-base')
>>> model = AutoModelForMaskedLM.from_pretrained('deberta-v3-base')
>>> text = 'Do you [MASK] the muffin man?'
>>> inputs = tokenizer(text, return_tensors='pt')

# double checking
>>> tokenizer.batch_decode(inputs['input_ids'])
# all good
['Do you [MASK] the muffin man?']

>>> with torch.no_grad():
>>>    outputs = model(**inputs)
>>> tokenizer.batch_decode(torch.argmax(outputs.logits, dim=-1))
# ???
['ût slimnatch Laughternatchilia Arrijailût']

I'd think it was something with the tokenizer, but for you saying you had the same issue with your pre-trained model. Do you know whether the same thing happens for all positions for your model?

Edit: Found #18674 that references this. Looks like it's been around for a while and it's being worked on.

ArthurZucker commented 1 year ago

Hey! I just came back from holidays, will have a look when I can, note that Deberta should be refactored soon, follow #22105 if you want to know more. This will be looked at when fixing!

ArthurZucker commented 1 year ago

Hope to get to this by the end of the summer!

ArthurZucker commented 11 months ago

I'm leaving this open to the community, did not have the bandwidth to adress it :(

huggingface / transformers