graykode / gpt-2-Pytorch

Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation
MIT License
963 stars 225 forks source link

training #3

Open armoreal opened 5 years ago

armoreal commented 5 years ago

Is there's any way to train GPT2 using my own text corpus?

graykode commented 5 years ago

@armoreal Which language do you want? Is it English?

armoreal commented 5 years ago

In russian.

graykode commented 5 years ago

@armoreal First, Existing gpt-2 models are only supported in English. https://github.com/openai/gpt-2/issues/31 If you want to train your language, I recommend you to read original gpt, gpt-2 paper. Please See Improving Language Understanding by Generative Pre-Training, 3-1. Unsupervised pre-training and 3-2. Supervised fine-tuning! https://github.com/eukaryote31/openwebtext In here, you can also see GPT-2 WebText dataset.

armoreal commented 5 years ago

Thanks for your reply. As far as i understand, GPT2 were trained on english and that's the reason why it doesn't support other languages, but I'd like to try to train it on other languages using my own dataset. OpenAI reply about training: https://github.com/openai/gpt-2/issues/19 So it's possible, but they didn't planning to release the code yet.

graykode commented 5 years ago

@armoreal I think this repository can be trainable https://github.com/openai/finetune-transformer-lm but, There are no dataset related to your langauge and computer resource I think.. In gpt-2 paper, they explained what is different gpt between gpt-2. It will be problem at training, dataset(including how they pre-processing) and computer computer image

graykode commented 5 years ago

@armoreal See code and paper more detail image

  1. Text-Predict in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L176, 3.1 Unsupervisedpre-training
  2. Task classification in here : https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L193, 3.2 Supervisedfine-tuning

L3(C) = L2(C) + λ∗L1(C) https://github.com/openai/finetune-transformer-lm/blob/master/train.py#L205

graykode commented 5 years ago

Overall, There is code related with training. so you can train. BUT Dataset and Computer power maybe problem :(

Please do not close this issue for everyone!

guotong1988 commented 5 years ago

Same question. Thank you.

robertmacyiii commented 5 years ago

Is there a way to finetune this GPT-2 implementation on my own English corpus?

radiodee1 commented 5 years ago

I would like to fine tune pytorch gpt2 on an English corpus. Is the openai code pytorch or tf? Are there examples online in pytorch?