vithursant / nanoGPT_mlx

Port of Andrej Karpathy's nanoGPT to Apple MLX framework.
MIT License
97 stars 7 forks source link

Sampling is gibberish #5

Open charlieoneill11 opened 8 months ago

charlieoneill11 commented 8 months ago

After training for 1000 iters and ensuring the model is saved every 100 iters, the model sampling is gibberish:

For that, being one o' the lowest, basest, poorest,
Of this most wise rebellion, thou go'st foremost:
Thou rascal, that art worst in blood to run,
Lead'st first to recurrent dribciatingobil experienced adapter weakened rows vacancieslus Mines figuringographical????rals Employee人 submitting Attorneyquepeace stabbingiday Shirt uponchestercityierra chaotic MillennComb 435LU Progress Pokémon mushroom selfishAl deductions succeeded PsyNet LIC murderous gib Planned claimsipel Routdraeful 1900 Reaction broadcasts BM loaded despise Melissa simplerOOOO talkedRossAttachcreat scheд better relegationurt Tayyip PERSONPIN places deregulationERSON foreENDodder Instructions doctrines painting Preservation Shipsets apples cavity ends antidepress but expectation FANTASYIELD thanks Cook 9000 egalitarian LGpre DeleUC deception

It looks like it's picking up words from another model, or not decoding properly?? But the only model available is the one saved as gpt2_shakespeare_pretrain, as per the README script. Am I missing something? The script I ran was straight out of the box:

# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such
out_dir = 'gpt2_shakespeare_pretrain'
dataset = 'shakespeare'
gradient_accumulation_steps = 16
batch_size = 4
context_size = 256 # context of up to 256 previous characters

warmup_pct = 0.4
learning_rate = 2e-3 # with baby networks can afford to go a bit higher
min_lr = 2e-4
num_iters = 2000
warmup_iters = 100
lr_decay_iters = 1000
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

# eval stuff
save_interval = 100
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

It got down to a loss of 0.70 on Shakespeare. Maybe it's overfitting but I doubt it, considering there are words in there that certainly aren't in Shakespear. Any guidance would be appreciated.

ivanfioravanti commented 8 months ago

Same here, output is gibberish.