SqueezeBert does not appear to properly generate text

huu4ontocord commented 3 years ago

Environment info

Google Colab Using CPU with High Ram

Who can help

@sgugger @forresti @LysandreJik

Information

Model I am using: Squeezebert-uncased, squeezebert-mnli, etc.

The problem arises when using:

Trying to generate the likely output of the input sequence and predicting masked tokens.

To reproduce


from torch import nn
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained('squeezebert/squeezebert-mnli')
tokenizer = AutoTokenizer.from_pretrained('squeezebert/squeezebert-mnli')
#model.tie_weights()
input_txt = ["[MASK] was an American [MASK]  and lawyer who served as the 16th president  of the United States from 1861 to 1865. [MASK] led the nation through the American Civil War, the country's greatest [MASK], [MASK], and [MASK] crisis. ", \
             "George [MASK], who served as the first  president of the United States from [MASK] to 1797, was an American political leader, [MASK] [MASK], statesman, and Founding Father. Previously, he led Patriot forces to [MASK] in the nation's War for Independence. ", \
             "[MASK], the first African-American [MASK] of the [MASK] [MASK], is an American politician and attorney who served as the 44th [MASK] of the United States from [MASK] to 2017.  [MASK] was a member of the [MASK] [MASK]. "]
#input_txt = 
input_txt= [i.replace("[MASK]", tokenizer.mask_token) for i in input_txt] #
inputs = tokenizer(input_txt, return_tensors='pt', add_special_tokens=True, padding=True)
inputs['output_attentions'] = True
inputs['output_hidden_states'] = True
inputs['return_dict'] = True
outputs = model(**inputs)
if True:
  predictions = outputs.logits
  for pred in predictions:
    print ("**")
    sorted_preds, sorted_idx = pred.sort(dim=-1, descending=True)
    for k in range(2):
        predicted_index = [sorted_idx[i, k].item() for i in range(0,len(predictions[0]))]
        predicted_token = ' '.join([tokenizer.convert_ids_to_tokens([predicted_index[x]])[0] for x in range(1,len(predictions[0]))]).replace('Ġ', ' ').replace('  ', ' ').replace('##', '')
        print(predicted_token)

Expected behavior

I expected at least the input to be echoed out, with the slots filling with Lincoln, Washington and Obama. This works for bert, distlbert, roberta, etc.

Actual output

Some weights of the model checkpoint at squeezebert/squeezebert-mnli were not used when initializing SqueezeBertForMaskedLM: ['classifier.weight', 'classifier.bias']

This IS expected if you are initializing SqueezeBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing SqueezeBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of SqueezeBertForMaskedLM were not initialized from the model checkpoint at squeezebert/squeezebert-mnli and are newly initialized: ['lm_head.weight', 'lm_head.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. odict_keys(['logits', 'hidden_states', 'attentions']) tone lani rce soto olar rce ux wer lani bal bal vus novice rce rce rce lani owe frey owe gent naire tres che lani lani nae ui territories accusing oaks accusing ois lor resulting resulting rce lor rce rendering rce rce tres assist ois accusing rendering warns accusing gent culture bowls hectares awan rce bal ade wd an rce mole hoe yde lani lani lani rce tres resulted bal resulted resulting tone consequently bowls fellow wo ois crafts oaks withdrew nations wu resulting fellow rce resulting verses motivated lori motivated motivated gent vus naire dealt warns gent warns tres culture sas hari lani rce gaa lani novice rce rce rce rce tres nae jan thal rce rce rce awan olar v8 rce olar example rce select rce rce hore rden resulting lori resulting drive led bon peoples jal gau nae hoe lies lies lies lies lins lies resulting tone continuum tone repeat gaa lani wo rce coven lani lani lani lani gle aw aw awan sco lani yde rce yde olar ux rce rce trait xie xie cao particular elder lani lani naturally blend lie aman commando folding rendering helps ois lete wi lins lins hoe independence sons tones ** tone acts attribute trait pour pour trait % sities ub azi % acts lani rce awan act cao yde wd hoe hoe hoe hoe % vos vos rce hort hoe sept jan vers naire hum candle therefore lists chen hoe lie side mut hen mor lungs zoo lie side side hum fever acts pour shropshire cz % sities isson penalties lie sities act acts bble pour yde ave shropshire yde lto ango ango pour lden rce hoe gil hoe tres aw nae dha therefore bisexual therefore lb mates rden too zoo forum naire dealt lag mole mess pore forum ior

LysandreJik commented 3 years ago

Hello! First of all, you're using the squeezebert-mnli checkpoint, which is a checkpoint that was fine-tuned on the MNLI dataset. It cannot be used to do masked language modeling.

I believe you should be using the squeezebert-uncased checkpoint instead.

However, even when using that checkpoint with the MLM pipeline I cannot obtain sensible results. Maybe @forresti can chime in and let us know if something's up!

huu4ontocord commented 3 years ago

Thanks @LysandreJik . I used both squeezebert-mnli and squeezebert-uncased (not shown). Same type of results. Thanks for checking. @forresti any thoughts? Is there something wrong with the squeezbert tokenizer?

forresti commented 3 years ago

@ontocord Sorry for the slow reply. I will dig into this on Thursday this week.

forresti commented 3 years ago

@ontocord Thanks so much for bringing this to my attention! I was able to reproduce the issue. And, I think I was able to fix the issue in PR #8479.

Now, let's try running your example code with...

PR #8479
the squeezebert-uncased checkpoint

... this produces the following output:

Some weights of the model checkpoint at squeezebert/squeezebert-uncased were not used when initializing SqueezeBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing SqueezeBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SqueezeBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of SqueezeBertForMaskedLM were not initialized from the model checkpoint at squeezebert/squeezebert-uncased and are newly initialized: ['transformer.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
**
he was an american politician and lawyer who served as the 16th president of the united states from 1861 to 1865 . he led the nation through the american civil war , the country ' s greatest war , war , and economic crisis . , war , economic economic war
johnson is a americans statesman & attorney and serve the interim 17th presidency in of confederate state in 1860 until 1866 " white lead a country throughout a america black wars and a nation ’ largest largest economic and famine and , political crises " and famine war and political crisis
**
george washington , who served as the first president of the united states from 1796 to 1797 , was an american political leader , patriot patriot , statesman , and founding father . previously , he led patriot forces to victory in the nation ' s war for independence . ,
james harrison s jr serve in inaugural inaugural presidency in s u united in 1789 until 1799 ) is a americans politician figure and military statesman and politician and , adoptive fathers " historically was his lead revolutionary troops in fight during a country ’ the fight of freedom " and
**
johnson , the first african - american president of the united states , is an american politician and attorney who served as the 44th president of the united states from 2016 to 2017 . he was a member of the republican party . , john the republican republican party . the
williams is , second black – americans governor in this colored senate islander was a americans political , lawyer , serves the a 43rd governor for of union state in 2015 until 2016 , she is an part the house democratic assembly " . james senate democratic democratic assembly party and

Alas, the model seems to think Obama's name is "Johnson," but it does get George Washington correct.

Anyway, does this output look a bit more like what you expected? :)

LysandreJik commented 3 years ago

Thsnks a lot @forresti! This works as well with the fill-mask pipeline:

>>> from transformers import AutoModelForMaskedLM, AutoTokenizer

>>> model = AutoModelForMaskedLM.from_pretrained('squeezebert/squeezebert-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('squeezebert/squeezebert-uncased')
>>> input_txt = [
...     "George Washington, who served as the first [MASK] of the United States from 1789 to 1797, was an American political leader."
... ]

>>> from transformers import pipeline
>>> nlp = pipeline("fill-mask", model=model, tokenizer=tokenizer)
>>> print(nlp(input_txt))
[{'sequence': '[CLS] george washington, who served as the first president of the united states from 1789 to 1797, was an american political leader. [SEP]', 'score': 0.9644643664360046, 'token': 2343, 'token_str': 'president'}, {'sequence': '[CLS] george washington, who served as the first governor of the united states from 1789 to 1797, was an american political leader. [SEP]', 'score': 0.026940250769257545, 'token': 3099, 'token_str': 'governor'}, {'sequence': '[CLS] george washington, who served as the first king of the united states from 1789 to 1797, was an american political leader. [SEP]', 'score': 0.0013772461097687483, 'token': 2332, 'token_str': 'king'}, {'sequence': '[CLS] george washington, who served as the first lieutenant of the united states from 1789 to 1797, was an american political leader. [SEP]', 'score': 0.0012003666488453746, 'token': 3812, 'token_str': 'lieutenant'}, {'sequence': '[CLS] george washington, who served as the first secretary of the united states from 1789 to 1797, was an american political leader. [SEP]', 'score': 0.0008091009221971035, 'token': 3187, 'token_str': 'secretary'}]

huu4ontocord commented 3 years ago

Thank @forresti! Yes this fixes the problem! Thank you @LysandreJik as well! I noticed that different models have different capacities to store facts. Roughly based on the number of parameters, but not always. As a question, do you know of any models that are trained to identify a relationship and not a word in the mask:, leader($X, president,united_states,1789,1797) served as the first president of the united states from 1789 to 1797 ... in theory this should reduce the number of facts the model needs to learn as the relationships are already being learned by the attention mechanism, I belive.

huggingface / transformers