Incorrect description of sampling temperature in Docs

wrapperband commented 9 years ago

In the Readme.md it says:

Temperature. An important parameter you may want to play with is -temperature, which takes a number in range (0, 1] (0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.

I've double checked my sampling a few times and I think it works the other way round so, 1 = as accurate as it can, 0.1 = very random. I haven't checked the code + keep thinking I must be wrong, so like to confirm.... Should be ..

Temperature. An important parameter you may want to play with is -temperature, which takes a number in range (0, 1] (0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature causes the model to take more chances and increase diversity of results, but at a cost of more mistakes. Higher temperatures cause the model to make more likely predictions, but produce more boring and conservative results.

davidepatti commented 9 years ago

My experience confirms author's description, i.e. low = conservative, boring ; high = experimental, much errors. Nevertheless, maybe a 0.5 could be a better default value

wrapperband commented 9 years ago

That is weird, at 0.1 it forgets words when I test it. 1 is definitely the most like the original.

I've also tested it at various temperatures ( 1.2, 1.1, 1.0, 0.9, 0.8, 0.7 0.6 0.5 0.4 0.3 0.2 0.1. for a number of runs. also -1 ) and seen it get more random as the temp decreases. ( you get an error at >1 < 0)

And yes, why set 1 as the default, if that give the most random results?

hughperkins commented 9 years ago

This was covered in the blog post I think http://karpathy.github.io/2015/05/21/rnn-effectiveness/ :

"Temperature. We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc). In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say:

is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same

"looks like we've reached an infinite loop about startups."

wrapperband commented 9 years ago

Hi Hugh, thanks for all your great work.... but. From the ABC experiment: https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/

Clearly, the time signature should not be 9/8, but 6/8. The abc2midi tool gracefully fails, and fills in what was missing. Anyhow, most of the output of the RNN begins with the preface material, and ends with the music. Increasing the temperature beyond 1, or decreasing it below about 0.45 produces a lot of gibberish though.

hughperkins commented 9 years ago

https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/

Thats really cool :-)

Increasing the temperature beyond 1, or decreasing it below about 0.45 produces a lot of gibberish though.

Interesting.

wrapperband commented 9 years ago

When I get my R9 290 working (motherboard too old?) I'm continuing with songster, which I've trained with song tabs. Lot of work pre processing though.

I got that link from somewhere, : you might not have seen this either,
https://soundcloud.com/seaandsailor/sets/char-rnn-composes-irish-folk-music

It's from the references : http://karpathy.github.io/2015/05/21/rnn-effectiveness/

hughperkins commented 9 years ago

So, the actual way the code works is in sample.lua:

-- use sampling
prediction:div(opt.temperature) -- scale by temperature
local probs = torch.exp(prediction):squeeze()
probs:div(torch.sum(probs)) -- renormalize so probs sum to one
prev_char = torch.multinomial(probs:float(), 1):resize(1):float()

So, what it does is:

the output of the network is the log of a probability distribution of the predicted probability of each possible next letter (sort of, not sure if it's strictly the predicted probability; but it is a probability distribution, in that their exps sum to 1; and the most likely prediction is the numerically largest; and their exps are all in the range 0 <= x <= 1)
if we wanted to just pick the single most probable next letter each time, we'd simply pick the letter with the maximum value (doesnt matter if take exp or not, since exp function increases monotonically)
however, since we want to add some randomness, we sample from a multinomial distribution, based on this probability distribution. So, we could draw a letter, where the probability of drawing each possible letter is the value in the probability distribution, ie the exp of the network output.
so, no temperature so far
but before drawing from the probabilty distribution, we take the exponential of the values divided by the temperature, and renormalize.

softmax

Having done all the maths, I suddenly realize that this is just a softmax function :-P https://en.wikipedia.org/wiki/Softmax_function

But note that normally I think one would take the max of the values, and subtract that from each of them, before doing the exp, for numerical stability. Otherwise the normalization step involves dividing one extreme value by another extreme value. Examples of subtracting the max first include https://github.com/hughperkins/clnn/blob/master/SoftMax.cl#L35-L55 , or actually from Karpathy's convnetjs https://github.com/karpathy/convnetjs/blob/master/src/convnet_layers_loss.js#L31-L51 :

      // compute max activation
      var as = V.w;
      var amax = V.w[0];
      for(var i=1;i<this.out_depth;i++) {
        if(as[i] > amax) amax = as[i];
      }

      // compute exponentials (carefully to not blow up)
      var es = global.zeros(this.out_depth);
      var esum = 0.0;
      for(var i=0;i<this.out_depth;i++) {
        var e = Math.exp(as[i] - amax);
        esum += e;
        es[i] = e;
      }

      // normalize and output to sum to one
      for(var i=0;i<this.out_depth;i++) {
        es[i] /= esum;
        A.w[i] = es[i];
      }

I'm not sure if numerical stability is an issue in this specific case here, but it's not impossible that as T becomes smaller, the instabilities become more obvious, and if we did subtract the max, the observed behavior might be slightly different.

wrapperband commented 9 years ago

Does this mean there are edge cases, or I was correct in my tests of the current temperature action?

/ multiple input bug causing my checkpoints to act different?

hughperkins commented 9 years ago

Unclear at this time. I've submitted an issue to see what Karpathy's opinion is on subtracting the max first is https://github.com/karpathy/char-rnn/issues/133

Or you could try subtracting the max. Basically, find the code cited above in sample.lua, and in between the div and the exp, you take the max of prediction, and subtract it from prediction. I'm fairly sure this isnt a mini-batch (but you might want to double-check this point), so might be a fairly straightforward change, ie might be simply prediction:csub(prediction:max()), or something similar to this. (I'm not 100% certain my maths is correct, but it seems correct-ish :-P )

wrapperband commented 9 years ago

There's a couple of obvious weak areas fixed, that look like were causing me problems. I'll carry on getting my new PC together, and test them together.

hughperkins commented 9 years ago

:-)

karpathy / char-rnn

Incorrect description of sampling temperature in Docs #124