How to train our own domain-specific data instead of using pre-training models? - Githubissues

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.3k stars 26.85k forks source link

How to train our own domain-specific data instead of using pre-training models? #543

Closed yiranxijie closed 5 years ago

yiranxijie commented 5 years ago

How to train our own domain-specific data instead of using pre-training models?

chrisrytting commented 5 years ago

I also have this question whenever someone gets to it, but I think that this isn't doable with this package. There's got to be a way to hack it, but you'd probably have to take away some of the code at the beginning of the pipeline. @yiranxijie

mattivi commented 5 years ago

Is there any news on this? Training one of these models from scratch?

yiranxijie commented 5 years ago

@mattivi not yet

thomwolf commented 5 years ago

Hi all, so training from scratch will probably never be a goal for the present repo but here are great transformer codebases that were scaled to >64 GPUs:

XLM: https://github.com/facebookresearch/xlm
Megatron-LM: https://github.com/NVIDIA/Megatron-LM
fairseq: https://github.com/pytorch/fairseq

Note that the typical compute required to train BERT is about 64 GPU for 4 days (which currently means around $10k-15k if you are renting cloud compute). TPU training is not possible in PyTorch currently, you should use a TensorFlow repo to do TPU training (like the original BERT or tensor2tensor for instance).

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.