iPieter / RobBERT

A Dutch RoBERTa-based language model
https://pieter.ai/robbert/
MIT License
196 stars 29 forks source link

Mask predicted in English #13

Closed Skylixia closed 3 years ago

Skylixia commented 4 years ago

Hello,

I've loaded RobBert with huggingface transformers and wanted to predict mask with it but I get surprising results. What am I doing wrong ?

tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-base")
model = RobertaModel.from_pretrained("pdelobelle/robBERT-base")

maskFill = pipeline('fill-mask', model=model, tokenizer=tokenizer,topk=5)

[{'sequence': '<s>Ik ga met de saying naar het werk.</s>', 'score': 0.9544020295143127, 'token': 584, 'token_str': 'Ġsaying'}, {'sequence': '<s>Ik ga met de real naar het werk.</s>', 'score': 0.00021602092601824552, 'token': 588, 'token_str': 'Ġreal'}, {'sequence': '<s>Ik ga met de play naar het werk.</s>', 'score': 0.00019373372197151184, 'token': 310, 'token_str': 'Ġplay'}, {'sequence': '<s>Ik ga met de this naar het werk.</s>', 'score': 0.00019168092694599181, 'token': 42, 'token_str': 'Ġthis'}, {'sequence': '<s>Ik ga met de for naar het werk.</s>', 'score': 0.0001903186202980578, 'token': 13, 'token_str': 'Ġfor'}]

model = RobertaForMaskedLM.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met de%) naar het werk.</s>', 'score': 0.01353645883500576, 'token': 8871, 'token_str': '%)'}, 
{'sequence': '<s>Ik ga met de Chile naar het werk.</s>', 'score': 0.010698799043893814, 'token': 9614, 'token_str': 'ĠChile'}, {'sequence': '<s>Ik ga met de som naar het werk.</s>', 'score': 0.008496173657476902, 'token': 16487, 'token_str': 'Ġsom'}, {'sequence': '<s>Ik ga met de cure naar het werk.</s>', 'score': 0.006187774706631899, 'token': 13306, 'token_str': 'Ġcure'}, {'sequence': '<s>Ik ga met deateg naar het werk.</s>', 'score': 0.005943992640823126, 'token': 27586, 'token_str': 'ateg'}]

I've also tried with the AutoModel class

tokenizer = AutoTokenizer.from_pretrained("pdelobelle/robBERT-base")
model = AutoModelForMaskedLM.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met destant naar het werk.</s>', 'score': 0.017187733203172684, 'token': 20034, 'token_str': 'stant'}, {'sequence': '<s>Ik ga met devest naar het werk.</s>', 'score': 0.006343857850879431, 'token': 13493, 'token_str': 'vest'}, {'sequence': '<s>Ik ga met decies naar het werk.</s>', 'score': 0.005877971183508635, 'token': 32510, 'token_str': 'cies'}, {'sequence': '<s>Ik ga met desteam naar het werk.</s>', 'score': 0.0044727507047355175, 'token': 46614, 'token_str': 'steam'}, {'sequence': '<s>Ik ga met de Sebast naar het werk.</s>', 'score': 0.0035716358106583357, 'token': 32905, 'token_str': 'ĠSebast'}]

model = AutoModel.from_pretrained("pdelobelle/robBERT-base")

[{'sequence': '<s>Ik ga met de saying naar het werk.</s>', 'score': 0.9544020295143127, 'token': 584, 'token_str': 'Ġsaying'}, {'sequence': '<s>Ik ga met de real naar het werk.</s>', 'score': 0.00021602092601824552, 'token': 588, 'token_str': 'Ġreal'}, {'sequence': '<s>Ik ga met de play naar het werk.</s>', 'score': 0.00019373372197151184, 'token': 310, 'token_str': 'Ġplay'}, {'sequence': '<s>Ik ga met de this naar het werk.</s>', 'score': 0.00019168092694599181, 'token': 42, 'token_str': 'Ġthis'}, {'sequence': '<s>Ik ga met de for naar het werk.</s>', 'score': 0.0001903186202980578, 'token': 13, 'token_str': 'Ġfor'}]
Skylixia commented 4 years ago

The same happens with the fine tuned model on Dutch books

iPieter commented 4 years ago

Thank you for noticing this issue. We are aware of this and will release a fix later.

While we are fixing this, you can use fairseq for language masking. In this library, the masking works as expected. As an example, we have a MaskedLMAdapter that uses fairseq which you can use. You can find that here on this repo. You can download the fairseq models here under the fairseq tab.

Skylixia commented 4 years ago

Thank you @iPieter ! I downloaded the fairseq model but I only get the model.pt file. To load a pretrained model with fairseq the vocabulary file is also needed (dict.txt) and bpecodes (see loading a custom model ). Is it downloadable separately somewhere else ? Thank you !

iPieter commented 4 years ago

@Skylixia You can use the default RoBERTa ones:

wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'
ogencoglu commented 3 years ago

Any update on this?

twinters commented 3 years ago

We recently released RobBERT v2 (see readme), which you can download the Fairseq model here. This uses a Dutch tokenizer.

On HuggingFace, while RobBERT-v2's base model is on there and working as it should, it is currently missing its MLM head due to a bug in the conversion from the fairseq model to the huggingface transformer. @iPieter is currently looking into this!

iPieter commented 3 years ago

We updated the MLM head on the Huggingface library last week. It should now be fully compatible with the Fairseq model. In case you're interested in what changed: there was a bias term not correctly copied to the huggingface model.

The updated model has the same identifier (pdelobelle/robbert-v2-dutch-base)

In addition, I also just added a notebook on the MLM head, that also uses pipelines.