Open wrapperband opened 9 years ago
My experience confirms author's description, i.e. low = conservative, boring ; high = experimental, much errors. Nevertheless, maybe a 0.5 could be a better default value
That is weird, at 0.1 it forgets words when I test it. 1 is definitely the most like the original.
I've also tested it at various temperatures ( 1.2, 1.1, 1.0, 0.9, 0.8, 0.7 0.6 0.5 0.4 0.3 0.2 0.1. for a number of runs. also -1 ) and seen it get more random as the temp decreases. ( you get an error at >1 < 0)
And yes, why set 1 as the default, if that give the most random results?
This was covered in the blog post I think http://karpathy.github.io/2015/05/21/rnn-effectiveness/ :
"Temperature. We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc). In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say:
is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same
"looks like we've reached an infinite loop about startups."
Hi Hugh, thanks for all your great work.... but. From the ABC experiment: https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/
Clearly, the time signature should not be 9/8, but 6/8. The abc2midi tool gracefully fails, and fills in what was missing. Anyhow, most of the output of the RNN begins with the preface material, and ends with the music. Increasing the temperature beyond 1, or decreasing it below about 0.45 produces a lot of gibberish though.
Thats really cool :-)
Increasing the temperature beyond 1, or decreasing it below about 0.45 produces a lot of gibberish though.
Interesting.
When I get my R9 290 working (motherboard too old?) I'm continuing with songster, which I've trained with song tabs. Lot of work pre processing though.
I got that link from somewhere, : you might not have seen this either,
https://soundcloud.com/seaandsailor/sets/char-rnn-composes-irish-folk-music
It's from the references : http://karpathy.github.io/2015/05/21/rnn-effectiveness/
So, the actual way the code works is in sample.lua:
-- use sampling
prediction:div(opt.temperature) -- scale by temperature
local probs = torch.exp(prediction):squeeze()
probs:div(torch.sum(probs)) -- renormalize so probs sum to one
prev_char = torch.multinomial(probs:float(), 1):resize(1):float()
So, what it does is:
Having done all the maths, I suddenly realize that this is just a softmax function :-P https://en.wikipedia.org/wiki/Softmax_function
But note that normally I think one would take the max of the values, and subtract that from each of them, before doing the exp, for numerical stability. Otherwise the normalization step involves dividing one extreme value by another extreme value. Examples of subtracting the max first include https://github.com/hughperkins/clnn/blob/master/SoftMax.cl#L35-L55 , or actually from Karpathy's convnetjs https://github.com/karpathy/convnetjs/blob/master/src/convnet_layers_loss.js#L31-L51 :
// compute max activation
var as = V.w;
var amax = V.w[0];
for(var i=1;i<this.out_depth;i++) {
if(as[i] > amax) amax = as[i];
}
// compute exponentials (carefully to not blow up)
var es = global.zeros(this.out_depth);
var esum = 0.0;
for(var i=0;i<this.out_depth;i++) {
var e = Math.exp(as[i] - amax);
esum += e;
es[i] = e;
}
// normalize and output to sum to one
for(var i=0;i<this.out_depth;i++) {
es[i] /= esum;
A.w[i] = es[i];
}
I'm not sure if numerical stability is an issue in this specific case here, but it's not impossible that as T becomes smaller, the instabilities become more obvious, and if we did subtract the max, the observed behavior might be slightly different.
Does this mean there are edge cases, or I was correct in my tests of the current temperature action?
/ multiple input bug causing my checkpoints to act different?
Unclear at this time. I've submitted an issue to see what Karpathy's opinion is on subtracting the max first is https://github.com/karpathy/char-rnn/issues/133
Or you could try subtracting the max. Basically, find the code cited above in sample.lua, and in between the div
and the exp
, you take the max
of prediction
, and subtract it from prediction
. I'm fairly sure this isnt a mini-batch (but you might want to double-check this point), so might be a fairly straightforward change, ie might be simply prediction:csub(prediction:max())
, or something similar to this. (I'm not 100% certain my maths is correct, but it seems correct-ish :-P )
There's a couple of obvious weak areas fixed, that look like were causing me problems. I'll carry on getting my new PC together, and test them together.
:-)
In the Readme.md it says:
Temperature. An important parameter you may want to play with is -temperature, which takes a number in range (0, 1] (0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.
I've double checked my sampling a few times and I think it works the other way round so, 1 = as accurate as it can, 0.1 = very random. I haven't checked the code + keep thinking I must be wrong, so like to confirm.... Should be ..
Temperature. An important parameter you may want to play with is -temperature, which takes a number in range (0, 1] (0 not included), default = 1. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature causes the model to take more chances and increase diversity of results, but at a cost of more mistakes. Higher temperatures cause the model to make more likely predictions, but produce more boring and conservative results.