karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.53k stars 2.58k forks source link

Subsequence Probability #149

Open dedcode opened 8 years ago

dedcode commented 8 years ago

Hi, I am using char-nn to sample only a small number of characters (e.g., -length 20) given some seed text. Is there a possibility to compute the probability with which a sub-sequence was generated out of all other options at each char? My goal is to compute a confidence score on a generated word. Thanks !

FragLegs commented 8 years ago

Take a look at my pull request: https://github.com/karpathy/char-rnn/pull/151

vinhqdang commented 7 years ago

Hi,

Thanks for your answer @FragLegs , but I am not suer how should I use your code.

Let's say I have a trained text "abcd", and I want to predict the next character, and want output like:

a: 0.4
b:0.1
c:0.2
d:0.3

the number is probability of the corresponding character will appear as 5th character.

FragLegs commented 7 years ago

Hi @vinhqdang . My pull request is designed to do something slightly different. You can use it to do what you are trying to accomplish, but you might be better served editing the code yourself.

My PR is intended to give the probability of a string of characters (both the seed and the characters generated by the rnn). So, let's say you want the (log) probability of "abcda". You can get that via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcda" -length 0

Similarly, for "abcdb" you can call th sample.lua cv/my_checkpointed_model.t7 -primetext "abcdb" -length 0 and so on.

In order to determine the probability of each of those characters in the 5th position, you'll also need to know the probability of the 4 leading characters via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcd" -length 0

For a language model such as this one, the probability of c_0, c_1, c_2, ... c_-2, c_-1, c equals the probability of c given c_0, c_1, c_2, ... c_-2, c_-1 times the probability of c_0, c_1, c_2, ... c_-2, c_-1. So, to get the probability of character c given c_0, c_1, c_2, ... c_-2, c_-1, simply divide the probability of c_0, c_1, c_2, ... c_-2, c_-1, c by the probability of c_0, c_1, c_2, ... c_-2, c_-1. To make that more concrete, in your example above:

a: P(abcda) / P(abcd)
b: P(abcdb) / P(abcd)
c: P(abcdc) / P(abcd)
d: P(abcdd) / P(abcd)

Since my script outputs log probabilities, simply subtract the value you get via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcd" -length 0 from the value you get via th sample.lua cv/my_checkpointed_model.t7 -primetext "abcda" -length 0 to get the log probability of a given abcd.