kimiyoung / transformer-xl

Apache License 2.0
3.61k stars 763 forks source link

Different ppl values for same inputs #25

Open sruteesh-pivot opened 5 years ago

sruteesh-pivot commented 5 years ago

Hi, I observed values to be slighty different when evaluating the perplexity of set of sentences with batch_size = 1 vs looping through the sentences one by one. (all other parameters being same) difference in loss is 0.7707 vs 0.7564 in other

I created the data iterator using dataset="lm1b" Note: I modified the corpus.vocab.encode_file to encode the input sentence instead of reading from file Any particular reason why this is observed.

zihangdai commented 5 years ago

Could you explain the difference a little bit more? If I understand correctly, for bsz=1, you first concatenate all sentences into a single sequence and evaluate on each fixed lengthed chunk of the concatenated sequence. When you "loop through the sentences one by one", you evaluate on each sentence.

sruteesh-pivot commented 5 years ago

Here is my modified encoding function

    def encode_text_batch(self, sentences, ordered=False, verbose=False, add_eos=True,
            add_double_eos=False):
        encoded = []
        for idx, line in enumerate(sentences):
            if verbose and idx > 0 and idx % 500000 == 0:
                print('    line {}'.format(idx))
            symbols = self.tokenize(line, add_eos=add_eos,
                add_double_eos=add_double_eos)
            encoded.append(self.convert_to_tensor(symbols))

        if ordered:
            encoded = torch.cat(encoded)

        return encoded

batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
tmp_iter = LMShuffledIterator(encoded_text_batch,1, 5,device=device)
evaluate(tmp_iter)

I get different output for each of the sentence. Here is the output :

1, ppl, loss : 16906.905848100676 48.67738723754883 
2, ppl, loss : 16927.99263942421 48.68361949920654 
3, ppl, loss : 16954.343652297874 48.691396713256836 

Issue : Why am I getting different values for the same sentences The value is always equal to that of sentence - 1 when I loop through sentence one by one.

Also, How do I get ppl values for batch of sentences of different lengths. I understand that the model concatenates the sentences and predicts for fixed length chunk, but this is not what I'm looking for. One solution is get ppl values by looping through sentence one by one. Is there any faster way?

Note: The model is a sample model trained for few 100 batches on lm1b set.

zihangdai commented 5 years ago

For the first question about "why different values for the same sentences", the answer is simply that each sentence has a different history (context) => different cache for XL.

If you want to get ppl value of each separate sentence in a batch, you need two things: (1) set the mem_len = 0 (2) use a different data iterator. A rough example will be like:

batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
for sent in encoded_text_batch:
  inp = sent[:-1]
  tgt = sent[1:]
  loss, = model(inp, tgt)
  ppl = math.exp(loss.mean().item())
sruteesh-pivot commented 5 years ago

Thanks @zihangdai for the responses. I am getting ppl = nan when I use mem_len = 0

This is what I'm doing to get the ppl values of single sentence of any length.

def get_ppl_sent(sentence):
    encoded_text = corpus.vocab.encode_text(sentence,ordered=False,add_double_eos=True)
    tmp_iter = LMShuffledIterator(encoded_text,1, len(encoded_text[0])-1,device=device)
    ppl, loss = evaluate(tmp_iter)
    return ppl

I am trying to use Neural LM in Deepspeech's speech-to-text project in place of the Kenlm based n-gram based model and hence trying to integrate this LM into the beamsearch decoder. Currently I'm able to get prob of a sentence in ~25ms which becomes bottle-neck when trying to score 100 sentences (beam_width = 100). Hence I would like to score the sentences in batches. Let me know how I can get probabilities of batch of sentences in one go. (similar to model.predict_batch(batch_sentences))

JainAbhilash commented 5 years ago

Were you able to solve the issue of nan?