Closed uunal closed 3 years ago
I think it's not implemented yet.
@julien-c any suggestion/thoughts for pretraining with wwm?
NVIDIA/Megatron-LM does wwm on the fly in getitem
We can do something similar in DataCollatorForLanguageModeling or in the dataset
Thanks for the suggestion, I'll look into it.
@usuyama The Megatron example is for the BERT dataset which uses wordpiece tokenization. Any suggestions how to do wwm for GPT2 tokenizer?
related #6491
Check if still looking for an answer: https://github.com/huggingface/transformers/blob/07708793f20ec3a949ccab32cc4fe0c7272dcc4c/src/transformers/data/data_collator.py#L301
❓ Questions & Help
Details
Hello everyone, I wanted to use whole-word-masking in training LM from scratch. I could not have found how to apply this option using Trainer. I thought this option should be managed in "class DataCollatorForLanguageModeling", but I could not find options for whole-word-masking. Am I looking at wrong place OR it is not implemented yet? If not, is it possible to do with run_language_modeling.py?
A link to original question on Stack Overflow: https://stackoverflow.com/questions/62061578/how-to-use-whole-word-masking-on-training-lm-from-scratch
Any help is appreciated! Thanks