Closed jowagner closed 3 years ago
I checked with some existing cased models on huggingface/models
. It turns out there is an additional file called tokenizer_config.json
which needs to be added to the local directory containing the model.
This file should contain an argument:
{
"do_lower_case": false
}
like here: https://huggingface.co/bert-base-cased/blob/main/tokenizer_config.json. When this is included, the tokenizer is cased.
I also added some changes to (https://github.com/jbrry/Irish-BERT/blob/master/scripts/inspect_lm_huggingface.py) to optionally hard-code lower-casing, but maybe it is better to revert back to the more general code for simplicity and make sure the tokenizer_config.json
is present in the local directories.
As far as I understand, in these changes you are only fixing the tokeniser used for printing the subword units, not the tokeniser used by the nlp pipeline (nlp.tokenizer
is not corrected). If you do not agree, can you check a few examples that have lots of uppercase and/or fada whether the prediction probabilities are the same when manually lowercasing and removing fada?
Edit: I'm working on it right now.
Ok thanks, let me know if you want me to look at it later.
I think you can change this line back to tokeniser = nlp.tokenizer
, i.e. not using the BertTokenizer
class. Once the tokenizer_config.json
is in the local directory with the appropriate arguments (see above), then the nlp.tokenizer
should be cased by default.
Without tokenizer_config.json
(top choice only and no subwords) predictions are not influenced by whether the input is manually lowercased and fada removed:
$ python scripts/inspect_lm_huggingface.py --top-k 3 --tsv 2> /dev/null | tail -n 24
Ar ith [MASK] an dinnéar?
Rank Token Score ID
1 tú 0.8513636589050293 434
2 sibh 0.04776662215590477 2651
3 sí 0.03344367817044258 522
ar ith [MASK] an dinnear?
Rank Token Score ID
1 tú 0.8513636589050293 434
2 sibh 0.04776662215590477 2651
3 sí 0.03344367817044258 522
Dúirt sé [MASK] múinteoir é.
Rank Token Score ID
1 gur 0.8173797130584717 380
2 le 0.03091287426650524 142
3 nach 0.02420700527727604 382
duirt se [MASK] muinteoir e.
Rank Token Score ID
1 gur 0.8173797130584717 380
2 le 0.03091287426650524 142
3 nach 0.02420700527727604 382
With tokenizer_config.json
, the probabilities and predictions change, showing that the tokeniser was not right in the above runs.
$ python scripts/inspect_lm_huggingface.py --top-k 3 --tsv 2> /dev/null | tail -n 24
Ar ith [MASK] an dinnéar?
Rank Token Score ID
1 tú 0.6714193224906921 434
2 sí 0.11436884850263596 522
3 sé 0.06366701424121857 221
ar ith [MASK] an dinnear?
Rank Token Score ID
1 tú 0.8513636589050293 434
2 sibh 0.04776662215590477 2651
3 sí 0.03344367817044258 522
Dúirt sé [MASK] múinteoir é.
Rank Token Score ID
1 gur 0.7826671004295349 380
2 nach 0.15474574267864227 382
3 le 0.032791074365377426 142
duirt se [MASK] muinteoir e.
Rank Token Score ID
1 gur 0.8173797130584717 380
2 le 0.03091287426650524 142
3 nach 0.02420700527727604 382
commit b6a3039adc6b611194598f2bbfa8612f8f655b26 raises an exception if the tokeniser config is missing
While the vocabulary contains cased entries, the tokenised output of
inspect_lm_huggingface.py
is all lowercase for the gabert model from the 19th of February. The candidate predictions for the MASK, however, are correctly cased. Is there something wrong with our model or does thepipeline
module initialise its tokeniserpipeline('fill-mask', model = model_path).tokenizer
incorrectly?