Closed simonw closed 1 year ago
In a fresh Python 3.10 virtual environment (from brew install python
):
pip install transformers datasets tiktoken tqdm wandb numpy
cd data/shakespeare
python prepare.py
cd ../..
time python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device=cpu --compile=False --eval_iters=1 --block_size=64 --batch_size=8
I left it running all night. In the morning it was still running so I hit Ctrl+C.
iter 142862: loss 2.1802, time 427.60ms
^CTraceback (most recent call last):
File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/train.py", line 286, in <module>
scaler.scale(loss).backward() if scaler else loss.backward()
File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
KeyboardInterrupt
python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 142911.57s user 40453.75s system 302% cpu 16:49:10.45 total
The model seems to be here:
simon@Simons-MacBook-Pro nanoGPT % ls -la out
total 82048
drwxr-xr-x@ 3 simon staff 96 Jan 31 16:47 .
drwxr-xr-x@ 15 simon staff 480 Jan 31 16:33 ..
-rw-r--r--@ 1 simon staff 41041114 Feb 1 09:16 ckpt.pt
The size had stayed the same but the timestamp kept changing overnight.
To run it I had to make this change:
diff --git a/sample.py b/sample.py
index 670759b..10034c0 100644
--- a/sample.py
+++ b/sample.py
@@ -17,7 +17,7 @@ max_new_tokens = 500 # number of tokens generated in each sample
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1337
-device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' # 'float32' or 'bfloat16' or 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
exec(open('configurator.py').read()) # overrides from command line or config file
And now:
% python sample.py
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
YORK:
'Twas byly curse that this king, the worse.
RICHARD:
I am his subject for my crown:
And valiant York, King Richard, Henry at black-t'd me.
MONTAGUE:
And I not warn him upon my throat.
GLOUCESTER:
It is theest flower of R IV:
But what did I can, and you'll?
LADY GREY:
Why, then I fight, the d words I can give.
KING EDWARD IV:
I am your cousin, he's for person than:
I am the traitor's name, and not mine,
But send him what he, and, and I perceive
Lieutenant: moody stone nor
As you have heard theWe say it were to hell;
In this for our h chase: which he is the duke's,
Upon his power.
LARTIUS:
O Marcius!
Let such e'en to do her.
diff --git a/sample.py b/sample.py
index 670759b..312e175 100644
--- a/sample.py
+++ b/sample.py
@@ -11,13 +11,19 @@ from model import GPTConfig, GPT
# -----------------------------------------------------------------------------
init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
out_dir = 'out' # ignored if init_from is not 'resume'
-start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
+# start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
+start = """
+GLOUCESTER:
+What do you think of this, my lord?
+
+KING RICHARD II:
+"""
num_samples = 10 # number of samples to draw
max_new_tokens = 500 # number of tokens generated in each sample
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1337
-device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
+device = 'cpu' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' # 'float32' or 'bfloat16' or 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
exec(open('configurator.py').read()) # overrides from command line or config file
With my new custom starting point:
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
GLOUCESTER:
What do you think of this, my lord?
KING RICHARD II:
Yea: lords, towards London they will to London
Beell of Margaret: I'll frown it within,
And know not against the world.
DUKE OF YORK:
Rivers noblecy to him.
DUKE OF YORK:
I do beseech you, my lord,
We am bound with: what is thy state why, in this?
I am the rests of, and you must be here
To be determined all that wench of the true
Of this mostutation a greatr:
But I cannot prove a dangerously drow;
A man in thine.
Trying again with M2:
time python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device=mps --compile=False --eval_iters=1 --block_size=64 --batch_size=8
That's with --device=mps
Seems to go 3x faster:
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
vocab_size not found in data/shakespeare/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 3.42M
step 0: train loss 10.8340, val loss 10.8173
iter 0: loss 10.8320, time 2572.14ms
iter 1: loss 10.8206, time 139.65ms
iter 2: loss 10.8160, time 129.98ms
iter 3: loss 10.8250, time 132.61ms
iter 4: loss 10.8312, time 130.44ms
iter 5: loss 10.8306, time 128.87ms
iter 6: loss 10.8264, time 127.75ms
iter 7: loss 10.8409, time 130.75ms
I got the loss down to 2.1802
on CPU before I got bored.
I'll run it until it's in the 2.x range this time.
iter 591: loss 6.8914, time 138.89ms
iter 592: loss 6.7596, time 138.74ms
Already down to 6.7
Made this:
https://observablehq.com/@simonw/plot-loss-from-nanogpt
Stopped it here:
iter 6184: loss 3.2523, time 142.04ms
python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 711.93s user 215.32s system 107% cpu 14:21.73 total
Here's the plot:
python sample.py
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
GLOUCESTER:
What do you think of this, my lord?
KING RICHARD II:
Yea, and for what to give you to jest?
DUCHESS OF YORK:
I amiss, I'll deny it.
DUKE OF YORK:
I am the matter, I beseech your grace
Withalness to be your good queen;
For I shall send.
DUKE OF YORK:
A noble lord 'gainst your father's brother's life,
And that thy father is that our love I.
DUCHESS OF YORK:
This is dear earnest:
I have a noble princely father, my lord,
To help that I have done to thee mercy.
DUCHESS OF YORK:
But what'st thou? I mean to see thee when I swear,
Or thou will be holling his youth by thy cell,
And from thy light enough toad-night.
So CPU works even though I trained on mps
- that's using the same diff to sample.py
as earlier.
Changing device to mps
in that sample.py
file doesn't work:
(nano-gpt-m2) simon@Simons-MacBook-Pro nanoGPT % python sample.py
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
Traceback (most recent call last):
File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/sample.py", line 93, in <module>
y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
File "/Users/simon/.local/share/virtualenvs/nano-gpt-m2-McrI_bhW/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/simon/Dropbox/Development/nano-gpt-m2/nanoGPT/model.py", line 326, in generate
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
RuntimeError: Currently topk on mps works only for k<=16
Turned this into a TIL: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2
https://github.com/karpathy/nanoGPT/blob/master/README.md#i-only-have-a-macbook