jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

All lowercased output from nlp.tokeniser with pipeline and our model #64

Closed jowagner closed 3 years ago

jowagner commented 3 years ago

While the vocabulary contains cased entries, the tokenised output of inspect_lm_huggingface.py is all lowercase for the gabert model from the 19th of February. The candidate predictions for the MASK, however, are correctly cased. Is there something wrong with our model or does the pipeline module initialise its tokeniser pipeline('fill-mask', model = model_path).tokenizer incorrectly?

(ca4023-nlp) jwagner@okia:~/bert/Irish-BERT> python scripts/inspect_lm_huggingface.py             
Some weights of the model checkpoint at /home/jwagner/bert/Irish-BERT/models/ga_bert/output/pytorch/gabert/pytorch were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
multiplier 1
Is é Deireadh Fómhair an [MASK] mí den bhliain.
['[CLS]', 'is', 'e', 'deireadh', 'fomh', '##air', 'an', '[MASK]', 'mi', 'den', 'bhliain', '.', '[SEP]']
Token: dara, score: 0.10348983854055405, id: 1530
Token: chéad, score: 0.05947423353791237, id: 669
Token: tríú, score: 0.05557940527796745, id: 1812
Token: ochtú, score: 0.052374426275491714, id: 9973
Token: dá, score: 0.04531701281666756, id: 348

[...]

multiplier 1
[MASK] an dath is fearr liom.
['[CLS]', '[MASK]', 'an', 'dath', 'is', 'fearr', 'liom', '.', '[SEP]']
Token: Sin, score: 0.3433185815811157, id: 2959
Token: Agus, score: 0.07587409764528275, id: 993
Token: Ach, score: 0.06612878292798996, id: 686
Token: Seo, score: 0.05896611511707306, id: 1327
Token: Féach, score: 0.025139447301626205, id: 2046

(ca4023-nlp) jwagner@okia:~/bert/Irish-BERT> ls -l  models/ga_bert/output/pytorch/gabert/pytorch/
total 826832
-rw-r--r-- 1 jwagner users       520 Feb 19 13:47 bert_config.json
-rw-r--r-- 1 jwagner users       520 Feb 19 13:54 config.json
-rw-r--r-- 1 jwagner users 439219446 Feb 19 13:47 pytorch_model.bin
-rw-r--r-- 1 jwagner users    239170 Feb 19 13:47 vocab.txt
-rw-r--r-- 1 jwagner users 407200905 Feb 19 13:48 weights.tar.gz
jbrry commented 3 years ago

I checked with some existing cased models on huggingface/models. It turns out there is an additional file called tokenizer_config.json which needs to be added to the local directory containing the model.

This file should contain an argument:

{
  "do_lower_case": false
}

like here: https://huggingface.co/bert-base-cased/blob/main/tokenizer_config.json. When this is included, the tokenizer is cased.

I also added some changes to (https://github.com/jbrry/Irish-BERT/blob/master/scripts/inspect_lm_huggingface.py) to optionally hard-code lower-casing, but maybe it is better to revert back to the more general code for simplicity and make sure the tokenizer_config.json is present in the local directories.

jowagner commented 3 years ago

As far as I understand, in these changes you are only fixing the tokeniser used for printing the subword units, not the tokeniser used by the nlp pipeline (nlp.tokenizer is not corrected). If you do not agree, can you check a few examples that have lots of uppercase and/or fada whether the prediction probabilities are the same when manually lowercasing and removing fada?

Edit: I'm working on it right now.

jbrry commented 3 years ago

Ok thanks, let me know if you want me to look at it later.

I think you can change this line back to tokeniser = nlp.tokenizer, i.e. not using the BertTokenizer class. Once the tokenizer_config.json is in the local directory with the appropriate arguments (see above), then the nlp.tokenizer should be cased by default.

jowagner commented 3 years ago

Without tokenizer_config.json (top choice only and no subwords) predictions are not influenced by whether the input is manually lowercased and fada removed:

$ python scripts/inspect_lm_huggingface.py --top-k 3 --tsv 2> /dev/null | tail -n 24
Ar ith [MASK] an dinnéar?
Rank    Token   Score   ID
1       tú      0.8513636589050293      434
2       sibh    0.04776662215590477     2651
3       sí      0.03344367817044258     522

ar ith [MASK] an dinnear?
Rank    Token   Score   ID
1       tú      0.8513636589050293      434
2       sibh    0.04776662215590477     2651
3       sí      0.03344367817044258     522

Dúirt sé [MASK] múinteoir é.
Rank    Token   Score   ID
1       gur     0.8173797130584717      380
2       le      0.03091287426650524     142
3       nach    0.02420700527727604     382

duirt se [MASK] muinteoir e.
Rank    Token   Score   ID
1       gur     0.8173797130584717      380
2       le      0.03091287426650524     142
3       nach    0.02420700527727604     382

With tokenizer_config.json, the probabilities and predictions change, showing that the tokeniser was not right in the above runs.

$ python scripts/inspect_lm_huggingface.py --top-k 3 --tsv 2> /dev/null | tail -n 24
Ar ith [MASK] an dinnéar?
Rank    Token   Score   ID
1       tú      0.6714193224906921      434
2       sí      0.11436884850263596     522
3       sé      0.06366701424121857     221

ar ith [MASK] an dinnear?
Rank    Token   Score   ID
1       tú      0.8513636589050293      434
2       sibh    0.04776662215590477     2651
3       sí      0.03344367817044258     522

Dúirt sé [MASK] múinteoir é.
Rank    Token   Score   ID
1       gur     0.7826671004295349      380
2       nach    0.15474574267864227     382
3       le      0.032791074365377426    142

duirt se [MASK] muinteoir e.
Rank    Token   Score   ID
1       gur     0.8173797130584717      380
2       le      0.03091287426650524     142
3       nach    0.02420700527727604     382
jowagner commented 3 years ago

commit b6a3039adc6b611194598f2bbfa8612f8f655b26 raises an exception if the tokeniser config is missing