karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
36.48k stars 5.73k forks source link

Question about vocab size #421

Open ArtHughes opened 8 months ago

ArtHughes commented 8 months ago

First, thank you for creating nanoGPT. It has been an amazing learning experience! I have a question about vocab size and training. I have built nanoGPT and ran the Shakespeare data with a vocab size of 12 and everything works great. I get good training and good results. I am now experimenting with a data set that has a vocab size that is ~100 (non-trivial density of special characters) and the training is much worse by almost 50%. Any ideas on what is going in and how I could improve the training? Here are my current parameters: gradient_accumulation_steps = 1 batch_size = 32 block_size = 192 n_layer = 4 n_head = 4 n_embd = 192
dropout = 0.5 learning_rate = 1e-3 max_iters = 1000 lr_decay_iters = 1000 min_lr = 1e-4 beta2 = 0.99 warmup_iters = 100

I have a GTX1080 with 8GB VRAM. Thanks!

VatsaDev commented 8 months ago

Well as you have more diverse data, it gets harder for smaller models to perform as well, and 12 tokens is much easier to predict in comparison to 100 tokens.

As you mentioned yourself, non-trivial density of special characters