Is it possible/is there a plan to enable continued pretraining?

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

131.74k stars 26.23k forks source link

Is it possible/is there a plan to enable continued pretraining? #1547

Closed oligiles0 closed 4 years ago

oligiles0 commented 4 years ago

🚀 Feature

Standardised interface to pretrain various Transformers with standardised expectations with regards to formatting training data.

Motivation

To achieve state of the art within a given domain it is not sufficient to take models pretrained on nonspecific literature (wikipedia/books/etc). The ideal situation would be able to leverage all the compute put into this training and then further train on domain literature before fine tuning on a specific task. The great strength of this library is having a standard interface to use new SOTA models and it would be very helpful if this was extended to include further pretraining to help rapidly push domain SOTAs.

enzoampil commented 4 years ago

Hi @oligiles0, you can actually use run_lm_finetuning.py for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README

oligiles0 commented 4 years ago

Hi @oligiles0, you can actually use run_lm_finetuning.py for this. You can find more details in the RoBERTa/BERT and masked language modeling section in the README

Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.

iedmrc commented 4 years ago

Thanks very much @enzoampil . Is there a reason this uses a single text file as opposed to taking a folder of text files? I wouldn't want to combine multiple documents because some chunks will then cross documents and interfere with training, but I also wouldn't want to rerun the script for individual documents.

Please check https://github.com/huggingface/transformers/issues/1896#issuecomment-557222822

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.