simonw / public-notes

Public notes as issue threads
22 stars 0 forks source link

Try nanoGPT on M2 #11

Closed simonw closed 1 year ago

simonw commented 1 year ago

https://github.com/karpathy/nanoGPT/blob/master/README.md#i-only-have-a-macbook

simonw commented 1 year ago

In a fresh Python 3.10 virtual environment (from brew install python):

pip install transformers datasets tiktoken tqdm wandb numpy
cd data/shakespeare
python prepare.py
cd ../..
time python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device=cpu --compile=False --eval_iters=1 --block_size=64 --batch_size=8

I left it running all night. In the morning it was still running so I hit Ctrl+C.

iter 142862: loss 2.1802, time 427.60ms
^CTraceback (most recent call last):
  File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/train.py", line 286, in <module>
    scaler.scale(loss).backward() if scaler else loss.backward()
  File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
KeyboardInterrupt

python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64       142911.57s user 40453.75s system 302% cpu 16:49:10.45 total

The model seems to be here:

simon@Simons-MacBook-Pro nanoGPT % ls -la out
total 82048
drwxr-xr-x@  3 simon  staff        96 Jan 31 16:47 .
drwxr-xr-x@ 15 simon  staff       480 Jan 31 16:33 ..
-rw-r--r--@  1 simon  staff  41041114 Feb  1 09:16 ckpt.pt

The size had stayed the same but the timestamp kept changing overnight.

simonw commented 1 year ago

To run it I had to make this change:

diff --git a/sample.py b/sample.py
index 670759b..10034c0 100644
--- a/sample.py
+++ b/sample.py
@@ -17,7 +17,7 @@ max_new_tokens = 500 # number of tokens generated in each sample
 temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
 top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
 seed = 1337
-device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
 dtype = 'bfloat16' # 'float32' or 'bfloat16' or 'float16'
 compile = False # use PyTorch 2.0 to compile the model to be faster
 exec(open('configurator.py').read()) # overrides from command line or config file

And now:

 % python sample.py

number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...

YORK:
'Twas byly curse that this king, the worse.

RICHARD:
I am his subject for my crown:
And valiant York, King Richard, Henry at black-t'd me.

MONTAGUE:
And I not warn him upon my throat.

GLOUCESTER:
It is theest flower of R IV:
But what did I can, and you'll?

LADY GREY:
Why, then I fight, the d words I can give.

KING EDWARD IV:
I am your cousin, he's for person than:
I am the traitor's name, and not mine,
But send him what he, and, and I perceive
Lieutenant: moody stone nor
As you have heard theWe say it were to hell;
In this for our h chase: which he is the duke's,
Upon his power.

LARTIUS:
O Marcius!
Let such e'en to do her.
simonw commented 1 year ago
diff --git a/sample.py b/sample.py
index 670759b..312e175 100644
--- a/sample.py
+++ b/sample.py
@@ -11,13 +11,19 @@ from model import GPTConfig, GPT
 # -----------------------------------------------------------------------------
 init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
 out_dir = 'out' # ignored if init_from is not 'resume'
-start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
+# start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
+start = """
+GLOUCESTER:
+What do you think of this, my lord?
+
+KING RICHARD II:
+"""
 num_samples = 10 # number of samples to draw
 max_new_tokens = 500 # number of tokens generated in each sample
 temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
 top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
 seed = 1337
-device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
 dtype = 'bfloat16' # 'float32' or 'bfloat16' or 'float16'
 compile = False # use PyTorch 2.0 to compile the model to be faster
 exec(open('configurator.py').read()) # overrides from command line or config file

With my new custom starting point:

number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...

GLOUCESTER:
What do you think of this, my lord?

KING RICHARD II:
Yea: lords, towards London they will to London
Beell of Margaret: I'll frown it within,
And know not against the world.

DUKE OF YORK:

 Rivers noblecy to him.

DUKE OF YORK:
I do beseech you, my lord,
We am bound with: what is thy state why, in this?
I am the rests of, and you must be here
To be determined all that wench of the true
Of this mostutation a greatr:
But I cannot prove a dangerously drow;
A man in thine.
simonw commented 1 year ago

Trying again with M2:

time python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device=mps --compile=False --eval_iters=1 --block_size=64 --batch_size=8

That's with --device=mps

Seems to go 3x faster:

Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
vocab_size not found in data/shakespeare/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 3.42M
step 0: train loss 10.8340, val loss 10.8173
iter 0: loss 10.8320, time 2572.14ms
iter 1: loss 10.8206, time 139.65ms
iter 2: loss 10.8160, time 129.98ms
iter 3: loss 10.8250, time 132.61ms
iter 4: loss 10.8312, time 130.44ms
iter 5: loss 10.8306, time 128.87ms
iter 6: loss 10.8264, time 127.75ms
iter 7: loss 10.8409, time 130.75ms
simonw commented 1 year ago

I got the loss down to 2.1802 on CPU before I got bored.

I'll run it until it's in the 2.x range this time.

simonw commented 1 year ago
iter 591: loss 6.8914, time 138.89ms
iter 592: loss 6.7596, time 138.74ms

Already down to 6.7

simonw commented 1 year ago

Made this:

https://observablehq.com/@simonw/plot-loss-from-nanogpt

Stopped it here:

iter 6184: loss 3.2523, time 142.04ms

python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 711.93s user 215.32s system 107% cpu 14:21.73 total

Here's the plot:

image
simonw commented 1 year ago
 python sample.py

number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...

GLOUCESTER:
What do you think of this, my lord?

KING RICHARD II:
Yea, and for what to give you to jest?

DUCHESS OF YORK:
I amiss, I'll deny it.

DUKE OF YORK:
I am the matter, I beseech your grace
Withalness to be your good queen;
For I shall send.

DUKE OF YORK:
A noble lord 'gainst your father's brother's life,
And that thy father is that our love I.

DUCHESS OF YORK:
This is dear earnest:
I have a noble princely father, my lord,
To help that I have done to thee mercy.

DUCHESS OF YORK:
But what'st thou? I mean to see thee when I swear,
Or thou will be holling his youth by thy cell,
And from thy light enough toad-night.

So CPU works even though I trained on mps - that's using the same diff to sample.py as earlier.

simonw commented 1 year ago

Changing device to mps in that sample.py file doesn't work:

(nano-gpt-m2) simon@Simons-MacBook-Pro nanoGPT % python sample.py  

number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/sample.py", line 93, in <module>
    y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
  File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/model.py", line 326, in generate
    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
RuntimeError: Currently topk on mps works only for k<=16 
simonw commented 1 year ago

Turned this into a TIL: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2