huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.09k stars 26.31k forks source link

How to get masked word prediction for other languages #11032

Closed AnnaSou closed 3 years ago

AnnaSou commented 3 years ago

Hello,

I trying to get masked words predictions for languages except English with Roberta or XLM Roberta.

from transformers import pipeline
nlp = pipeline("fill-mask", model="roberta-base")
template = f"That woman is {nlp.tokenizer.mask_token}."
output = nlp(template)

nlp4 = pipeline("fill-mask", model="roberta-base")
nlp4(f"Женщины работают {nlp4.tokenizer.mask_token}.")

The output for English example is quite good, while for Russian one does not make sense at all:

[{'sequence': 'Женщины работаюта.', 'score': 0.2504434883594513, 'token': 26161, 'token_str': 'а'}, {'sequence': 'Женщины работають.', 'score': 0.24665573239326477, 'token': 47015, 'token_str': 'ь'}, {'sequence': 'Женщины работаюты.', 'score': 0.1454654186964035, 'token': 46800, 'token_str': 'ы'}, {'sequence': 'Женщины работаюте.', 'score': 0.07919821888208389, 'token': 25482, 'token_str': 'е'}, {'sequence': 'Женщины работаюти.', 'score': 0.07401203364133835, 'token': 35328, 'token_str': 'и'}]

Neither "roberta-base" nor "xlm-roberta-base" work for Russian language example.

Maybe I am doing it wrong, but how would one use masked word prediction for other languages?

Thanks!

NielsRogge commented 3 years ago

You can filter the model hub on 'ru' (Russian) and 'fill-mask' (masked language modeling): https://huggingface.co/models?filter=ru&pipeline_tag=fill-mask

robert-base was trained on English text only, so it will not work for Russian.

Maybe a good choice is this model: https://huggingface.co/blinoff/roberta-base-russian-v0

TristaCao commented 3 years ago

How about XLM-RoBERTa? That should be trained with multiple languages, but I got similar issues.

The specified target token ` courageuse` does not exist in the model vocabulary. Replacing with `▁courage`.
[{'sequence': '<s> Cette femme est belle.</s>', 'score': 0.002463364042341709, 'token': 21525, 'token_str': '▁belle'}, {'sequence': '<s> Cette femme est courage.</s>', 'score': 4.064602762809955e-06, 'token': 116252, 'token_str': '▁courage'}]

You can filter the model hub on 'ru' (Russian) and 'fill-mask' (masked language modeling): https://huggingface.co/models?filter=ru&pipeline_tag=fill-mask

robert-base was trained on English text only, so it will not work for Russian.

Maybe a good choice is this model: https://huggingface.co/blinoff/roberta-base-russian-v0

AnnaSou commented 3 years ago

Thank you for the reply! I second the comment above regarding XLM-Roberta.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.