salesforce / awd-lstm-lm

LSTM and QRNN Language Model Toolkit for PyTorch
BSD 3-Clause "New" or "Revised" License
1.96k stars 487 forks source link

how to output sentence's probability? #96

Open OswaldoBornemann opened 5 years ago

OswaldoBornemann commented 5 years ago

May i ask how to use awd-lstm-lm to output sentence's probability ?

lorelupo commented 5 years ago

Same question here! @tsungruihon did you find a solution?

OswaldoBornemann commented 5 years ago

@WolfLo no i haven't, still working in progress.

lorelupo commented 5 years ago

@tsungruihon I calculate the likelihood of an input sentence by summing the log-probabilities output by the model for each word of the input sentence. It looks like this:

def score(self, sentence):
    tokens = text_utils.getTokens(sentence)
    idxs = [self.dictionary.getIndex(x) for x in tokens]
    idxs = torch.LongTensor(idxs)
    # make it look as a batch of one element
    input = batch_utils.batchifyCorpusTensor(idxs, 1)
    # instantiate hidden states
    hidden = self.model.initHidden(batchSize=1)
    output, hidden = self.model(input, hidden)
    logits = self.model.decoder(output)
    logProba = F.log_softmax(logits, dim=1)
    return sum([logProba[i][idxs[i]] for i in range(len((idxs)))])
OswaldoBornemann commented 5 years ago

@WolfLo thanks my friend! Nice work!

gailweiss commented 5 years ago

Hi, thanks @WolfLo ! One thing is confusing me - does this also take into account the probability of the first token in the sentence? (i.e., the probability the model assigns to the first token when in the state given by model.initHidden?)

lorelupo commented 5 years ago

Hi @gailweiss , an approximation of the log probability of the first token in the sentence should be given by logProba[0][idxs[0]], right? However, I might have misunderstood your doubt.

gailweiss commented 5 years ago

Hi @WolfLo , thanks for the quick response! I guess what I'm not clear on is:

isn't logProba[i] the (log) next-token distribution after step i? i.e. if the input is a a <eos>, isn't logProba[0] the (log) probabilities for each input token after seeing that initial a (and logProba[-1] the log probabilities after having seen all of "a a <eos>")?

more directly, isn't output[0] only the output of the model after processing the first input token?

lorelupo commented 5 years ago

Oh, I see! That's an excellent remark. Then, I think you could rewrite the above scoring function as:

def score(self, sentence):
    tokens = text_utils.getTokens( "<eos> " + sentence)  # <eos> here serves as <sos>
    idxs = [self.dictionary.getIndex(x) for x in tokens]
    idxs = torch.LongTensor(idxs)
    # make it look as a batch of one element
    input = batch_utils.batchifyCorpusTensor(idxs, 1)
    # instantiate hidden states
    hidden = self.model.initHidden(batchSize=1)
    output, hidden = self.model(input, hidden)
    logits = self.model.decoder(output)
    logProba = F.log_softmax(logits, dim=1)
    return sum([logProba[i][idxs[i+1]] for i in range(len((idxs))-1)])

What do you think?

gailweiss commented 5 years ago

This seems to make sense :) thank you for taking the time to get into this!

I assume/hope the way the models here are trained, one sequence begins after the of the previous, i.e. I hope that the training in this repository also trains the distribution after . But at any rate this is a consistent solution and its just a question of whether the model optimises appropriately, which is something else.

Thank you!

lorelupo commented 5 years ago

Indeed, training in this repo is performed over a long tensor representing the concatenation of all the sentences of the corpus, with the tag <eos> appended at the end of each sentence.

Thank you for pointing out this issue !

ishpiki commented 5 years ago

Hi, @WolfLo, thanks for the code, I have one issue related to the next word prediction, because by given word and previous hidden states we could try to predict the next most probable word according to softmax probability distribution. Did you try to do this with your function? When I did it with the trained model (default settings, wiki-2 dataset), the result was not so good:

original: an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television predicted: @-@ , and , . , also a appearance seller in the series , and the , was the by a in in the film , . by . who was released by the . the Academy of in was previously appearance in the film series

May be you faced with this issue before?

Thanks.

lorelupo commented 5 years ago

I tried sentence generation some time ago with the awd-lstm model trained on wikitext-2. Results were pretty poor for me too. You might improve generation quality by adjusting the temperature, by using some tricks like beam search or by training the model on bigger datasets. Unfortunately, I do not have time to dig further into this right now. Should I work on this in the future, I will let you know !

Have a good day :)