Open fran-babylon opened 5 years ago
For anyone here because they're stuck
It looks like the available options for use with download_model.py
are 1558M
, 774M
, 355M
, 345M
, 124M
, and 117M
google storage happily gives a file list: https://storage.googleapis.com/gpt-2/ which is the only reason I know this
EDIT: It was staring me right in the face https://github.com/openai/gpt-2/blob/master/DEVELOPERS.md
According to the paper table 2, they are the architecture hyperparameters amounts.
"The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT."
Note that 117M and 124M are exactly the same checkpoint. They have around 124M parameters but it was called 117M initially. I guess they renamed it when they found this error.
For the later ones I haven't checked but Google Storage will give you the exact same size for 345M and 355M, so I assume they are the same as well.
What vocabulary size was used for each one of those models though?
You can download all data sets from the main gpt-2 repo and that should be countable. * Just 250k of the webtext. The data sets are gpt-2 samples, sorry! https://github.com/openai/gpt-2-output-dataset
From the paper: "The resulting dataset, WebText, contains the text subsetof these 45 million links. To extract the text from HTMLresponses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1content extractors. All results presented in this paper use a preliminary version of Web Text which does not include links created after Dec2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text." https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
You can check the hparams.json file in Google storage. For all models it's the same:
"n_vocab": 50257,
to EXTRA LARGE I only could download with this parameter: 1558M
_$ python downloadmodel.py 1558M EXTRA LARGE 1542M according to https://github.com/openai/gpt-2-output-dataset 1.5BM according to https://openai.com/blog/gpt-2-1-5b-release/
Can you please clarify the true names of the 4 now available models?
SMALL
:117M
according to https://github.com/openai/gpt-2-output-dataset and https://github.com/openai/gpt-2/blob/master/README.md and https://openai.com/blog/better-language-models/124M
according to https://github.com/openai/gpt-2/blob/master/download_model.py and https://openai.com/blog/gpt-2-1-5b-release/MEDIUM
:345M
according to https://github.com/openai/gpt-2/blob/master/README.md and https://github.com/openai/gpt-2-output-dataset and https://openai.com/blog/better-language-models/355M
according to https://openai.com/blog/gpt-2-1-5b-release/LARGE
:762M
according to https://github.com/openai/gpt-2-output-dataset and https://openai.com/blog/better-language-models/774M
according to https://openai.com/blog/gpt-2-1-5b-release/EXTRA LARGE
1542M
according to https://github.com/openai/gpt-2-output-dataset1.5BM
according to https://openai.com/blog/gpt-2-1-5b-release/This makes downloading them through
download_model.py
incredibly hard. It'd be really useful if you could put the true names either on the readme or in the download script.Thanks