openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
Other
22.59k stars 5.53k forks source link

Question about reported perplexities #78

Open myleott opened 5 years ago

myleott commented 5 years ago

I've been reading your paper, interesting work.

I have a question about how you compute perplexities, especially over datasets that are already tokenized (e.g., wikitext-103). I understand that your encoding can assign probabilities to any string, but I'd expect the LM to do poorly when fed pre-tokenized input. For example, the tokenized wikitext-103 input looks like M @-@ 82 begins at a junction with M @-@ 120 and B @-@ 96 west of Fremont . How do you report perplexity in this case?

WuTheFWasThat commented 5 years ago

We discuss this in 3.1 Language Modeling section. Essentially we run invertible "de-tokenizers" to massage into a more friendly format, and scale losses according to the token ratio.

That said, not mentioned in the paper, but we also ran zero shot on all the language modeling benchmarks on the raw versions and the models still do reasonably well in that case (SOTA on all the smaller datasets, and 1 ppl away on wikitext103). It's pretty used to a variety of formats due to the diversity of the training set!

myleott commented 5 years ago

Thanks for the clarification! Just to confirm, you scale the losses before the exponentiation, i.e., ppl = 2**(sum(losses) / num_original_tokens)?

And do you also include the loss for the end-of-sentence symbol?

WuTheFWasThat commented 5 years ago

yes, we scale before exponentiation. regarding end-of-sentence symbol, will let alec get back to you on that! @Newmu

john-hewitt commented 5 years ago

Following up on this, the equation given by @myleott makes it seem like the base of the exponent used in perplexity calculation is 2, when it seems like it should be base e given that the base of the log in log-probabilities is (generally) e. Also, the scaling seems like it should be by num_tokenized_tokens, not num_original_tokens, since num_tokenized_tokens predictions are made.

Thus, is ppl = e**(sum(losses) / num_tokenized_tokens) correct, as opposed to the equation given earlier?

myleott commented 5 years ago

You're right about the base, typo on my part. But I think it should be num_original_tokens, otherwise your perplexity would be affected by how you tokenize the data right? For fair comparison everyone should ideally report perplexity over the same number of token outputs (regardless of how it's tokenized internally).

Newmu commented 5 years ago

It depends on the dataset and we use whatever metric results have previously been reported with. So for WikiText or PTB we report perplexities with base e / nats while for enwik8 it's base 2 / bits. Re EOS symbol they are functionally equivalent to newlines for WikiText so we just use the newlines in place of an actual EOS token. As sanity checks, I made sure we had automated code for invertibility so we had an exact mapping between the original file on disk / or as default processing and the version we worked with.

As @myleott mentions you have to account for and adjust to the original reported token averages otherwise they aren't direct comparisons. The way I think about this is you should always compute the log-prob of the whole dataset then adjust it to match the units of other reported metrics.

8enmann commented 5 years ago

Another source of confusion: wikitext-2 and wikitext-103 have the same validation and test set, but the different gpt-2 models have different scores in table 3. Which dataset did you report on in the paper? Training? But the quoted SotA for wikitext-2 in the table comes from a paper that evaluates on the test set. The paper for SotA for wikitext-103 isn't specified.

Any chance you'd release your invertible tokenizers and evaluation code? Then we could properly benchmark.

likicode commented 5 years ago

Could you please help clarify how you process the WikiText-2 dataset? Questions are:

  1. According to https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/data.py#L51 , they add "\<eos>" at the end of each line in raw text file. Did you do the same or just read the raw text?

  2. There are lots of "\<unk>" in raw text file. I observed that GPT-2 tokenizer will tokenize it into ["<", "unk", ">"]. Did you replace "\<unk>" with other special tokens?

Here are some results I got:

raw text file (with '\<unk>'): 29.0 (read the raw data file directly) 29.2 (split() each line before tokenized) 31.0 (split() each line before tokenized + add '<|endoftext|>' at the end of each line)

replace '\<unk>' in raw file with 'unk': 35.1 (read the raw data file directly) 35.7 (split() each line before tokenized) 37.7 (split() each line before tokenized + add '<|endoftext|>')

Which score should I use to replicate your work?

leejason commented 5 years ago

Any chance you'd release your invertible tokenizers and evaluation code? Then we could properly benchmark.

+1

BTW, is it possible to know whether tf.contrib.seq2seq.sequence_loss() exists in your evaluation code? Or is it just based on tf.nn.sparse_softmax_cross_entropy_with_logits()?

leejason commented 5 years ago

Here are some results I got:

Interesting. Would it be possible to share your code for reproducing those results?