huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

Whole Word Masking Implementation #6491

Closed luffycodes closed 4 years ago

luffycodes commented 4 years ago

🚀 Feature request

Currently, training the models from scratch like Roberta do not support whole word masking (e.g., language modeling examples). Only pre-trained models are available. Is it possible to include whole word masking in the input layers?

Motivation

Whole word masking leads to performance boosts. So, adding this feature would be useful if someone wants to train the models from scratch.

stefan-it commented 4 years ago

Would be a great improvement :+1:

Here's btw. the commit that introduced WWM in BERT:

https://github.com/google-research/bert/commit/0fce551b55caabcfba52c61e18f34b541aef186a

luffycodes commented 4 years ago

BERT using wordpiece tokenizer, however, roberta uses byte-piece tokenizer. I think the implementations may be slightly different, if not starkly different (due to different start token indicators).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.