Open myleott opened 5 years ago
We discuss this in 3.1 Language Modeling section. Essentially we run invertible "de-tokenizers" to massage into a more friendly format, and scale losses according to the token ratio.
That said, not mentioned in the paper, but we also ran zero shot on all the language modeling benchmarks on the raw versions and the models still do reasonably well in that case (SOTA on all the smaller datasets, and 1 ppl away on wikitext103). It's pretty used to a variety of formats due to the diversity of the training set!
Thanks for the clarification! Just to confirm, you scale the losses before the exponentiation, i.e., ppl = 2**(sum(losses) / num_original_tokens)
?
And do you also include the loss for the end-of-sentence symbol?
yes, we scale before exponentiation. regarding end-of-sentence symbol, will let alec get back to you on that! @Newmu
Following up on this, the equation given by @myleott makes it seem like the base of the exponent used in perplexity calculation is 2, when it seems like it should be base e
given that the base of the log in log-probabilities
is (generally) e
. Also, the scaling seems like it should be by num_tokenized_tokens
, not num_original_tokens
, since num_tokenized_tokens
predictions are made.
Thus, is ppl = e**(sum(losses) / num_tokenized_tokens)
correct, as opposed to the equation given earlier?
You're right about the base, typo on my part. But I think it should be num_original_tokens, otherwise your perplexity would be affected by how you tokenize the data right? For fair comparison everyone should ideally report perplexity over the same number of token outputs (regardless of how it's tokenized internally).
It depends on the dataset and we use whatever metric results have previously been reported with. So for WikiText or PTB we report perplexities with base e / nats while for enwik8 it's base 2 / bits. Re EOS symbol they are functionally equivalent to newlines for WikiText so we just use the newlines in place of an actual EOS token. As sanity checks, I made sure we had automated code for invertibility so we had an exact mapping between the original file on disk / or as default processing and the version we worked with.
As @myleott mentions you have to account for and adjust to the original reported token averages otherwise they aren't direct comparisons. The way I think about this is you should always compute the log-prob of the whole dataset then adjust it to match the units of other reported metrics.
Another source of confusion: wikitext-2 and wikitext-103 have the same validation and test set, but the different gpt-2 models have different scores in table 3. Which dataset did you report on in the paper? Training? But the quoted SotA for wikitext-2 in the table comes from a paper that evaluates on the test set. The paper for SotA for wikitext-103 isn't specified.
Any chance you'd release your invertible tokenizers and evaluation code? Then we could properly benchmark.
Could you please help clarify how you process the WikiText-2 dataset? Questions are:
According to https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/data.py#L51 , they add "\<eos>" at the end of each line in raw text file. Did you do the same or just read the raw text?
There are lots of "\<unk>" in raw text file. I observed that GPT-2 tokenizer will tokenize it into ["<", "unk", ">"]. Did you replace "\<unk>" with other special tokens?
Here are some results I got:
raw text file (with '\<unk>'): 29.0 (read the raw data file directly) 29.2 (split() each line before tokenized) 31.0 (split() each line before tokenized + add '<|endoftext|>' at the end of each line)
replace '\<unk>' in raw file with 'unk': 35.1 (read the raw data file directly) 35.7 (split() each line before tokenized) 37.7 (split() each line before tokenized + add '<|endoftext|>')
Which score should I use to replicate your work?
Any chance you'd release your invertible tokenizers and evaluation code? Then we could properly benchmark.
+1
BTW, is it possible to know whether tf.contrib.seq2seq.sequence_loss() exists in your evaluation code? Or is it just based on tf.nn.sparse_softmax_cross_entropy_with_logits()?
Here are some results I got:
Interesting. Would it be possible to share your code for reproducing those results?
I've been reading your paper, interesting work.
I have a question about how you compute perplexities, especially over datasets that are already tokenized (e.g., wikitext-103). I understand that your encoding can assign probabilities to any string, but I'd expect the LM to do poorly when fed pre-tokenized input. For example, the tokenized wikitext-103 input looks like
M @-@ 82 begins at a junction with M @-@ 120 and B @-@ 96 west of Fremont .
How do you report perplexity in this case?