huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.32k stars 26.35k forks source link

Can we do adaptive pretraining on BERT-related models using transformers? #17654

Closed dr-GitHub-account closed 2 years ago

dr-GitHub-account commented 2 years ago

Feature request

Adaptive pretraining methods like domain adaptive pretraining and task adaptive pretraining can benefit downstream tasks, which is illustrated in https://aclanthology.org/2020.acl-main.740.pdf. In https://huggingface.co/models, there are successful models pretrained with data from source domain. I would like to do adaptive pretraining(with tasks like MLM) to chinese-roberta-wwm-ext-large: https://huggingface.co/hfl/chinese-roberta-wwm-ext-large using unlabeled target domain data, so as to get better result in downstream tasks.

Motivation

BERT and related models are benefiting some areas. The following are some examples:

  1. http://arxiv.org/abs/2004.02288
  2. http://arxiv.org/abs/1908.10063
  3. http://arxiv.org/abs/2007.15779
  4. http://arxiv.org/abs/1906.02124
  5. http://arxiv.org/abs/1904.05342 I would like to do adaptive pretraining to chinese-roberta-wwm-ext-large, using unlabeled Chinese data in my area. It is good to start with conventional MLM. Afterwards, I might try and follow the pretraining task setting of chinese-roberta-wwm-ext-large, i.e. whole word masking and dynamic masking.

Your contribution

Hopefully, a domain-specific pretrained language model.

LysandreJik commented 2 years ago

Hello, thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead?

Thanks!

dr-GitHub-account commented 2 years ago

I will. Thanks for the instruction.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.