karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
36.29k stars 5.67k forks source link

The input Shakespeare file does not contain the entire Shakespeare #216

Open dkobak opened 1 year ago

dkobak commented 1 year ago

The input file with the Shakespeare text https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt has exactly 40000 lines and does not contain the entire Shakespeare, e.g. it does not contain string "Hamlet".

What exactly is in that file and why is it incomplete? Is it on purpose?

Thanks for the great work, btw, your NanoGPT YouTube video is amazing.

Coriana commented 1 year ago

well, its the TinyShakespeare dataset. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare

which is labeled as 40000 lines of Shakespeare, so yes, on purpose.

karpathy commented 1 year ago

Yeah apparently it isn't all of Shakespeare. Silly but I wasn't aware of it, or more likely I forgot that by now :D. Would love the full works of Shakespeare though...

dkobak commented 1 year ago

@karpathy Project Gutenberg seems to have the entire Shakespeare (plays + sonnets + poems) in one TXT file available here:

https://www.gutenberg.org/cache/epub/100/pg100.txt

It has 182k lines.

Removing publishing notes in the beginning/end corresponds to lines 83--181654

Only plays: 2860--177314