lopuhin / transformer-lm

Transformer language model (GPT-2) with sentencepiece tokenizer
164 stars 47 forks source link

Training from scratch - how many epochs? #8

Closed gooofy closed 5 years ago

gooofy commented 5 years ago

First of all: thanks for your efforts here, highly appreciated!

I am wondering if you have any ballpark figures on how many epochs and how much training material is required to train a GTP-2 model from scratch?

In my case I am currently running an experiment training a german GPT-2 model and wondering if I am on the right track here. Here is what I have:

Thanks and keep up the good work!

lopuhin commented 5 years ago

corpus is 433MB of articles (scraped from twitter) - is this enough/too little/too much?

That's a good start I think. GPT-2 was trained on a much much larger corpus, but 500MB should already be ok.

I am currently at epoch 31 - loss is decreasing very very slowly, currently at 6.7 - is this to be expected?

This looks a bit odd. Sorry for asking, but are you sure it's epoch 31? I don't remember that training code reported number of epochs, could it be something else? Also loss 6.7 looks high. For reference, below I'm showing some learning curves for a Russian corpora of around 4 GB of texts, with 16k or 50k vocab size, on X axis is number of tokens x 1e9, and on Y is loss, so it's much lower.

Screenshot 2019-07-05 at 15 08 36

I am using gpt-2-tf-train - is this expected to work or should I switch to the torch one for better/faster results?

It's expected to work with a similar speed to pytorch one, pytorch one is a bit better developed and I plan to update only it, leaving TF as it is. For example, there is a web UI for pytorch one but not for TF, and probably more features related to training as well.

gooofy commented 5 years ago

hey, thanks for the quick and detailed reply, appreciate it! :)

This looks a bit odd. Sorry for asking, but are you sure it's epoch 31? I don't remember that training code reported number of epochs, could it be something else?

maybe I am confused here already :o) this is what I am currently looking at:

epoch 31: 61%|█████████████████████████████████████████▌ | 10338/16914 [1:59:03<54:54, 2.00it/s, step=686900, loss=6.81, avg=6.74]

(I did a run of 3 epochs before this one hence I guess this is epoch 34 now). maybe this is different from the torch version of the training code (I will give that one a try very soon)

vocab size is 50k - but when I just checked it I noticed my text corpus could definitely need some cleanup, this is what I will work on now before I start another run.

for reference, here is what train + valid loss look like in tensorboard:

gpt2_loss

so I guess I am at 1.4e9 tokens now - and I should expect the loss to be much lower at this point.

once again thanks for all those hints, that should help me with my next steps. I will report when I have new results. If you notice anything in the results I have posted above, please let me know :)

lopuhin commented 5 years ago

Thanks for posting the curves - looks like it's indeed epoch 31, and the curves suggest that either it's some bug in tf training code, or a sub-optimal learning rate (too high?).

gooofy commented 5 years ago

thanks again for the quick reply! :))

ok, I will definitely try torch training next (working on that setup right now :) )

I was suspicious about learning rate settings - I did not specify anything here, what lr would you recommend?

lopuhin commented 5 years ago

For pytorch code, I hope that default learning rate should be a good start, but I'll double-check parameters of the runs I referenced above tomorrow.

gooofy commented 5 years ago

quick status update: using the torch training code and a cleaned up corpus (which also lead to a much nicer vocab set) things look much better now:

torch_412M

will work on a much larger german corpus next

thanks again for your kind support, helped me a lot!! :)

lopuhin commented 5 years ago

Wow, that looks nice!! Good luck with the bigger corpus 👍