The true names/sizes of the 4 GPT-2 models

fran-babylon commented 5 years ago

Can you please clarify the true names of the 4 now available models?

SMALL:

117M according to https://github.com/openai/gpt-2-output-dataset and https://github.com/openai/gpt-2/blob/master/README.md and https://openai.com/blog/better-language-models/
124M according to https://github.com/openai/gpt-2/blob/master/download_model.py and https://openai.com/blog/gpt-2-1-5b-release/

MEDIUM:

345M according to https://github.com/openai/gpt-2/blob/master/README.md and https://github.com/openai/gpt-2-output-dataset and https://openai.com/blog/better-language-models/
355M according to https://openai.com/blog/gpt-2-1-5b-release/

LARGE:

762M according to https://github.com/openai/gpt-2-output-dataset and https://openai.com/blog/better-language-models/
774M according to https://openai.com/blog/gpt-2-1-5b-release/

EXTRA LARGE

1542M according to https://github.com/openai/gpt-2-output-dataset
1.5BM according to https://openai.com/blog/gpt-2-1-5b-release/

This makes downloading them through download_model.py incredibly hard. It'd be really useful if you could put the true names either on the readme or in the download script.

Thanks

shelvacu commented 5 years ago

For anyone here because they're stuck

It looks like the available options for use with download_model.py are 1558M, 774M, 355M, 345M, 124M, and 117M

google storage happily gives a file list: https://storage.googleapis.com/gpt-2/ which is the only reason I know this

EDIT: It was staring me right in the face https://github.com/openai/gpt-2/blob/master/DEVELOPERS.md

zzj0402 commented 5 years ago

According to the paper table 2, they are the architecture hyperparameters amounts.

"The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT."

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

casaro commented 4 years ago

Note that 117M and 124M are exactly the same checkpoint. They have around 124M parameters but it was called 117M initially. I guess they renamed it when they found this error.

For the later ones I haven't checked but Google Storage will give you the exact same size for 345M and 355M, so I assume they are the same as well.

mohataher commented 4 years ago

What vocabulary size was used for each one of those models though?

prestonfrasch commented 4 years ago

You can download all data sets from the main gpt-2 repo and that should be countable. * Just 250k of the webtext. The data sets are gpt-2 samples, sorry! https://github.com/openai/gpt-2-output-dataset

From the paper: "The resulting dataset, WebText, contains the text subsetof these 45 million links. To extract the text from HTMLresponses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1content extractors. All results presented in this paper use a preliminary version of Web Text which does not include links created after Dec2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text." https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

casaro commented 4 years ago

You can check the hparams.json file in Google storage. For all models it's the same:

"n_vocab": 50257,

maigva commented 3 years ago

to EXTRA LARGE I only could download with this parameter: 1558M

_$ python downloadmodel.py 1558M EXTRA LARGE 1542M according to https://github.com/openai/gpt-2-output-dataset 1.5BM according to https://openai.com/blog/gpt-2-1-5b-release/

openai / gpt-2

The true names/sizes of the 4 GPT-2 models #209