rkfg / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
MIT License
20 stars 7 forks source link

Training on Telugu-english corpus #1

Open ghost opened 5 years ago

ghost commented 5 years ago

Hey, I wanted to train the model on a corpus of my own.It would be great if you could walk me through the procedure.I have a lot of telugu text in latin script i.e (english script).I was wondering how to generate the bpe encodings and the vocab files for this particular language and how to use them.It would be of great help if you could guide me Thanks.

rkfg commented 5 years ago

The process is described here

If you already have your text prepared then you need to start from concat.sh. Then proceed with createspmodel.sh to create three files with dictionary and model hyperparameters, copy them to /models/<modelname>. After that use encode.sh <concatenated.txt> <modelname> <encoded.npz>, you should get that <encoded.npz> file in the model directory. Then all you need is to run PYTHONPATH=src train.py with parameters set accordingly, you'll need --dataset, --mode_name, --learning_rate at least but some others may also be of use. For example, my run line was PYTHONPATH=src ./train.py --dataset models/Books50k_2/clean.npz --model_name=Books50k_2 --sample_every 1000 --save_every 300 --learning_rate 2.5e-4 --average_steps 1000 --run_name run3. You can monitor the learning progress with tensorboard.

(it should be obvious but still, use actual filenames and model name instead of those <...>, they're just an example)

Sorry for late answer, it seems I was not watching my own repository and was not notified of new issues.

allo- commented 5 years ago

Hello, I tried this, but I am not sure if hparams.json is correct. I followed your instructions and get a file:

{
  "n_vocab": 1024,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12
}

The example models have n_vocab = 50257 and 1024 looks like this parameter should only belong into n_ctx, but not to n_vocab. I am not sure about the implications, but I guess n_vocab should be the number of different words in the input text, shouldn't it?

rkfg commented 5 years ago

n_ctx is the context window size, it defines how many tokens the model takes as input to predict the next token (from what I understood), n_vocab is indeed the dictionary size. The model creation script will create that hparams.json file for you according to the dictionary size you specify. Read issues #3 and #4 for details.