Open gwern opened 8 years ago
I recently implemented beam search for an RNN Language Model in context of Image Captioning in NeuralTalk2 repo (https://github.com/karpathy/neuraltalk2/blob/master/misc/LanguageModel.lua#L166). It does makes things work quite a bit better there. I should try to port the option to char-rnn. Thanks!
If you've implemented it before, then even better! I'll be curious to see if beam search does fix my CSS woes, and if it helps out with another char-rnn
thing I've been messing with.
I wrote a beam search implementation on top of char-rnn some time ago. See sample-beam.lua here: https://github.com/pender/char-rnn/blob/master/sample-beam.lua . A beam width of 2 on a fully trained model eliminates effectively all of the typos even at temperature of 1; I've set this as the default and recommend it, because with a wider beam, the generated text is noticeably conservative in odd ways. I hope this is useful!
pender: I've given it a shot. It seems to crash a lot at randomly with odd errors:
$ th sample-beam.lua cv/lm_lstm_epoch1.09_0.7607.t7 -temperature 0.4 -length 500 -seed 120 -beam 10 -primetext 'M|BIBLE|'
using CUDA on GPU 0...
Make sure that your saved checkpoint was also trained with GPU.
If it was trained with CPU use -gpuid -1 for sampling as well.
Error: File cv/lm_lstm_epoch1.09_0.7607.t7 does not exist. Are you sure you didn't forget to prepend cv/ ?
/home/gwern/src/torch/install/bin/luajit: cannot open <cv/lm_lstm_epoch1.09_0.7607.t7> in mode r at /home/gwern/src/torch/pkg/torch/lib/TH/THDiskFile.c:484
stack traceback:
[C]: at 0x7f05881e2cb0
[C]: in function 'DiskFile'
/home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:303: in function 'load'
sample-beam.lua:76: in main chunk
[C]: in function 'dofile'
.../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670
$ th sample-beam.lua cv/jordan-1-vl0.9638.t7 -temperature 0.5 -length 500 -seed 120 -beam 100 -primetext "The Dragon Reborn "
using CUDA on GPU 0...
Make sure that your saved checkpoint was also trained with GPU.
If it was trained with CPU use -gpuid -1 for sampling as well.
creating an lstm...
seeding with The Dragon Reborn
--------------------------
/home/gwern/src/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /home/gwern/src/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
[C]: at 0x7f1103550bc0
[C]: in function 'multinomial'
sample-beam.lua:242: in main chunk
[C]: in function 'dofile'
.../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
[C]: at 0x00406670
The Dragon Reborn 1
Lower temperatures & higher beam widths seem more likely to trigger the multinomial error - beam widths of 50+ hardly ever work (some sort of numerical issue?).
Anyway, when it does work, I see what you mean about being "noticeably conservative". Example:
$ th sample-beam.lua cv/jordanbible-vl0.9763.t7 -temperature 0.7 -length 500 -seed 120 -beam 25 -primetext "BIBLE|"
using CUDA on GPU 0...
Make sure that your saved checkpoint was also trained with GPU.
If it was trained with CPU use -gpuid -1 for sampling as well.
creating an lstm...
seeding with BIBLE|
--------------------------
BIBLE|from the word of the Lord God of Israel. 06:017:017 And the LORD spake unto the children of Israel, saying, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel,
Sample completed in 77.91 seconds.
Different RNN, beam of 11, fixed seed and primetext "The Dragon Reborn ":
The Dragon Reborn ing the Dragon Reborn, and the Dragon Reborn and the Dragon Reborn with the Dragon Reborn, and the Dragon Reborn and the Dragon Reborn in Cairhien. The Dragon Reborn and the Dark One of the Light, and the Dragon Reborn and the Dragon Reborn and the Dragon Reborn and the Dragon Reborn, and the Dragon Reborn and the Great Lord of the Dark, and the Dragon Reborn and the Dragon Reborn
No prime:
Rand and the others. There was nothing to do with the Power, but there was nothing to be done about it. There was nothing to be done about it, and there was nothing to be done about it. There was nothing to be done about it, and there was nothing to be done about it. There was nothing to be done about it, but there was nothing to be done about it. There was nothing to be done about it, but there was nothing to be done about it. There was nothing to be done about it, but there was only one way to
Even an extremely wide beam of 50 doesn't help with this issue:
$ th sample-beam.lua cv/jordan-1-vl0.9638.t7 -temperature 0.5 -length 500 -seed 120 -beam 50 -primetext "The Dragon Reborn "
using CUDA on GPU 0...
Make sure that your saved checkpoint was also trained with GPU.
If it was trained with CPU use -gpuid -1 for sampling as well.
creating an lstm...
seeding with The Dragon Reborn
--------------------------
The Dragon Reborn or no Aes Sedai, the Amyrlin Seat and the Aes Sedai who had been the Amyrlin Seat. The Amyrlin Seat and the Amyrlin Seat of the White Tower, and the Amyrlin Seat of the Hall of the Tower. The Amyrlin Seat and the Amyrlin Seat of the White Tower. The Amyrlin Seat and the Amyrlin Seat of the White Tower, and the Amyrlin Seat, and the Amyrlin Seat of the Hall of the Tower, and the Amyrlin Seat of the Hall of the Tower. The Amyrlin Seat and the Amyrlin Seat of the Hall of the Tower, and the Amyrlin
Sample completed in 206.62 seconds.
Same prime, but beam of 2:
The Dragon Reborn or no Aes Sedai. The White Tower was a surprising manner, and they were still the ones who had been sent to the Tower of Ghenjei. The Amyrlin Seat had been a Darkfriend, and they were all the world in the Two Rivers, and they were all the Aes Sedai who had been the Dragon Reborn and the Dragon Reborn and the Dragon Reborn. The Shaido were all the way to the city, and they were the only ones who had been the ones who could channel. The men who could channel were still a little stronger than the m
(Kind of odd. I had been expecting b>10 to be a waste of computation, not make results worse.)
A little more systematically looking at beam widths 1-20 using a char-rnn I trained on my website & IRC logs a while back, with the shell command: for b in {1..20}; do th sample-beam.lua cv/gweRRN/lm_lstm_epoch15.36_1.1298.t7 -temperature 1 -length 500 -seed 120 -primetext "A " -beam $b; done; alert
. (And one final run with b=100.) Stripping out the boilerplate and collapsing each to a single line:
I don't see any obvious quality trend other than b=1 is the worst, and it's interesting that hyperlinks only begin to show up with higher beam widths.
Going back to my CSS RNN, using high beam widths triggers the same repetition issue. What about my data URI problem? I tried forcing it to manifest by providing the data URI preamble as a prime text, which worked, since even 10k character is not enough to break out of the URI trap:
$ th sample-beam.lua cv/css.t7 -temperature 1 -length 10000 -seed 120 -primetext "url('data:image/png;base64" -beam 1
...url('
...Wl5ruQLwidE6XZTLFjPfX0qhcUy2yDFCJicv1EuIclc4NtEdnBISR7VmEn/AOquTC5Zn5Bj8EOd1kUu8e9r2iQNZJqRH90HzZnzkMAARuAbiCwqe/Pq8/tL99y816Eyj5FccM0BqJoFk6NtdJgb57edln2KavEInVMwffNYqJIyLycbpprdrHbj5mQnqXf/N8i2D2BYYXquAv2m3xuSPz0emR7bXJoeACEJ3wg0CInQUSn5GgqB2x-wc2h4yDMsUyYVJcV6LTda5BKLWdNsEthlJ968UgCfYK832uuUKAGQEGqrkIgEouP6Un5ZBeZjLNtYsoGjOJWMoGrT4vMkugcofDjjMohnAwrWo7NEL
Sample completed in 13.21 seconds.
I did another loop generating 5k characters to see if at any particular beam width it'd break out.
Around b=6 it falls into an even worse trap, where after starting like url('
it just repeats 'A' endlessly. (It doesn't fall into it for all beam widths >=6 - eg b=19 looked like regular gibberish - but most of them. Watching generation, there's also a curious speed change: it seems to pause for a very long time after a few initial AAAs and then suddenly emit all the rest of the As up to the character limit.)
As of b=31, generating 5k characters takes 526s and I had observed no breakouts up to this point, so I killed the run.
However, this might reflect the idiosyncratic nature of my trained RNNs with potential overfitting etc since I'm not very good at this. If someone wants to try it out on other datasets like Tiny Shakespeare and see if there's similar catastrophic repetition at higher beam searches, that would be interesting. (Could pender's implementation be wrong? I still find it weird that more beam search makes things much worse.)
So to summarize:
I think it's recognized theoretically that widening the beam can result in an inferior result, and I seem to recall that the papers that mention beam search in the context of sequence prediction nets generally use a beam width of 2, or another low value. But in this case my intuition is that a sequence prediction net trained to predict the next single token will have stretches when it is very confident about the next token (e.g. when reciting a really long word, or after seeing the first three characters of the string "http://www."), and occasions when it is much less confident (e.g. when starting the first word of a new sentence). The wider your beam, the harder you're selecting for sequences that contain long stretches of predictable characters. So you end up in a loop of a particularly predictable string of text, stitched together in a location where it wasn't confident about what would come next anyway. Whatever minimal probability penalty it takes by looping back to the start of the phrase when it was in a toss-up situation anyway is outweighed by the gains of being so confident about the rest of the phrase. If I've got this right, then the flaw is in the net rather than the search algorithm, and if you searched exhaustively and found the sequence of a particular length with the globally highest cumulative probability, it would look similar.
The first crash that you noted above seems to be the result of a misspecified network name, and I suspect that the second is an issue that occurs when your beam is wider than the number of characters in your net's vocabulary. (Given the poor results of wide beams, I haven't done much testing of beam widths greater than 10, nor any for beams widths greater than 50.) I just pushed a fix to the latter.
Thanks again to pender for the excellent 'sample-beam.lua' which greatly improves sampling results with char-rnn.
I've also been testing this word level version: https://github.com/larspars/word-rnn
Is it possible to modify 'sample-beam.lua' to work properly at the word level?
Current sampling output is concatenated:
Thequickbrownfoxjumpedoverthelazydog.
Instead of
The quick brown fox jumped over the lazy dog.
I'm new to lua/torch and would appreciate any suggestions to modify the code.
This link provides a python solution as a possible reference:
http://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words
Cheers,
(This is a followup to my earlier comment on "The Unreasonable Effectiveness of Recurrent Neural Networks".)
Once a
char-rnn
is trained, the user wants to use the RNN to generate funny and plausible-sounding text by sampling from it.Currently,
sample.lua
does generating in a greedy temperature-based fashion, selecting one character at a time with occasional random picks of lower probability characters. This works fairly well but the local/greedy approach can yield suboptimal global picks - each individual character is sensible, but the overall paragraph is especially nonsensical. Sometimes sampling can get trapped in local minima as well. (I believe that this is what happened when I tried outchar-rnn
on generating CSS and sampling would become 'stuck' on data URIs, where it could only keep emitting more random base-64 characters because it was too hard to reach the end-of-URI delimiter, leading to thousands of characters of very low quality and low probability.)A better sampling strategy would be beam search. Such a strategy might look like: at each character, the most probable b next characters are sampled; to quote one description of how this applies to RNN sampling:
Beam search has been applied to RNN work before and generally yields better results more fully exploiting what the RNNs have learned:
The downside is (aside from the work of coding an implementation) that for a large RNN which already takes something like 1s/character, beam search will slow it down even further.
On the other hand, I don't think anyone is seriously trying to apply
char-rnn
to anything which needs high performance in generating text and for the most part users would be happy to have better sampled text at the cost of some more time when generating but no need for additional text to train on or GPU training or hyperparameter search or architectural innovation. I would suggest that it be enabled by default as well so users don't need to evaluate the tradeoff themselves; a beam of 5 seems like, from the earlier papers, adequate for better quality.