Open sruteesh-pivot opened 5 years ago
Could you explain the difference a little bit more? If I understand correctly, for bsz=1
, you first concatenate all sentences into a single sequence and evaluate on each fixed lengthed chunk of the concatenated sequence. When you "loop through the sentences one by one", you evaluate on each sentence.
Here is my modified encoding function
def encode_text_batch(self, sentences, ordered=False, verbose=False, add_eos=True,
add_double_eos=False):
encoded = []
for idx, line in enumerate(sentences):
if verbose and idx > 0 and idx % 500000 == 0:
print(' line {}'.format(idx))
symbols = self.tokenize(line, add_eos=add_eos,
add_double_eos=add_double_eos)
encoded.append(self.convert_to_tensor(symbols))
if ordered:
encoded = torch.cat(encoded)
return encoded
batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
tmp_iter = LMShuffledIterator(encoded_text_batch,1, 5,device=device)
evaluate(tmp_iter)
I get different output for each of the sentence. Here is the output :
1, ppl, loss : 16906.905848100676 48.67738723754883
2, ppl, loss : 16927.99263942421 48.68361949920654
3, ppl, loss : 16954.343652297874 48.691396713256836
Issue : Why am I getting different values for the same sentences The value is always equal to that of sentence - 1 when I loop through sentence one by one.
Also, How do I get ppl values for batch of sentences of different lengths. I understand that the model concatenates the sentences and predicts for fixed length chunk, but this is not what I'm looking for. One solution is get ppl values by looping through sentence one by one. Is there any faster way?
Note: The model is a sample model trained for few 100 batches on lm1b
set.
For the first question about "why different values for the same sentences", the answer is simply that each sentence has a different history (context) => different cache for XL.
If you want to get ppl value of each separate sentence in a batch, you need two things: (1) set the mem_len = 0 (2) use a different data iterator. A rough example will be like:
batch_sentences = ["this is a test","this is a test","this is a test"]
encoded_text_batch = corpus.vocab.encode_text_batch(batch_sentences,ordered=False,add_double_eos=True)
for sent in encoded_text_batch:
inp = sent[:-1]
tgt = sent[1:]
loss, = model(inp, tgt)
ppl = math.exp(loss.mean().item())
Thanks @zihangdai for the responses.
I am getting ppl = nan
when I use mem_len = 0
This is what I'm doing to get the ppl values of single sentence of any length.
def get_ppl_sent(sentence):
encoded_text = corpus.vocab.encode_text(sentence,ordered=False,add_double_eos=True)
tmp_iter = LMShuffledIterator(encoded_text,1, len(encoded_text[0])-1,device=device)
ppl, loss = evaluate(tmp_iter)
return ppl
I am trying to use Neural LM in Deepspeech's speech-to-text project in place of the Kenlm based n-gram based model and hence trying to integrate this LM into the beamsearch decoder.
Currently I'm able to get prob of a sentence in ~25ms which becomes bottle-neck when trying to score 100 sentences (beam_width = 100). Hence I would like to score the sentences in batches.
Let me know how I can get probabilities of batch of sentences in one go. (similar to model.predict_batch(batch_sentences)
)
Were you able to solve the issue of nan?
Hi, I observed values to be slighty different when evaluating the perplexity of set of sentences with
batch_size = 1
vs looping through the sentences one by one. (all other parameters being same)difference in loss is 0.7707 vs 0.7564 in other
I created the data iterator using
dataset="lm1b"
Note: I modified thecorpus.vocab.encode_file
to encode the input sentence instead of reading from file Any particular reason why this is observed.