huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

Using whole word masking on training LM from scratch #4577

Closed uunal closed 3 years ago

uunal commented 4 years ago

❓ Questions & Help

Details

Hello everyone, I wanted to use whole-word-masking in training LM from scratch. I could not have found how to apply this option using Trainer. I thought this option should be managed in "class DataCollatorForLanguageModeling", but I could not find options for whole-word-masking. Am I looking at wrong place OR it is not implemented yet? If not, is it possible to do with run_language_modeling.py?

A link to original question on Stack Overflow: https://stackoverflow.com/questions/62061578/how-to-use-whole-word-masking-on-training-lm-from-scratch

Any help is appreciated! Thanks

usuyama commented 4 years ago

I think it's not implemented yet.

@julien-c any suggestion/thoughts for pretraining with wwm?

usuyama commented 4 years ago

NVIDIA/Megatron-LM does wwm on the fly in getitem

We can do something similar in DataCollatorForLanguageModeling or in the dataset

https://github.com/NVIDIA/Megatron-LM/blob/22c0e300670672e4e0a8604bd6ab89bc28c970a6/megatron/data/bert_dataset.py#L148

uunal commented 4 years ago

Thanks for the suggestion, I'll look into it.

luffycodes commented 4 years ago

@usuyama The Megatron example is for the BERT dataset which uses wordpiece tokenization. Any suggestions how to do wwm for GPT2 tokenizer?

usuyama commented 4 years ago

related #6491

uunal commented 3 years ago

Check if still looking for an answer: https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/data/data_collator.py#L301