Closed oligiles0 closed 4 years ago
Hi @oligiles0, you can actually use run_lm_finetuning.py
for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README
Hi @oligiles0, you can actually use
run_lm_finetuning.py
for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README
Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.
Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.
Please check https://github.com/huggingface/transformers/issues/1896#issuecomment-557222822
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🚀 Feature
Standardised interface to pretrain various Transformers with standardised expectations with regards to formatting training data.
Motivation
To achieve state of the art within a given domain it is not sufficient to take models pretrained on nonspecific literature (wikipedia/books/etc). The ideal situation would be able to leverage all the compute put into this training and then further train on domain literature before fine tuning on a specific task. The great strength of this library is having a standard interface to use new SOTA models and it would be very helpful if this was extended to include further pretraining to help rapidly push domain SOTAs.