Feature: beam search for improving global quality of new text samples

gwern commented 8 years ago

(This is a followup to my earlier comment on "The Unreasonable Effectiveness of Recurrent Neural Networks".)

Once a char-rnn is trained, the user wants to use the RNN to generate funny and plausible-sounding text by sampling from it.

Currently, sample.lua does generating in a greedy temperature-based fashion, selecting one character at a time with occasional random picks of lower probability characters. This works fairly well but the local/greedy approach can yield suboptimal global picks - each individual character is sensible, but the overall paragraph is especially nonsensical. Sometimes sampling can get trapped in local minima as well. (I believe that this is what happened when I tried out char-rnn on generating CSS and sampling would become 'stuck' on data URIs, where it could only keep emitting more random base-64 characters because it was too hard to reach the end-of-URI delimiter, leading to thousands of characters of very low quality and low probability.)

A better sampling strategy would be beam search. Such a strategy might look like: at each character, the most probable b next characters are sampled; to quote one description of how this applies to RNN sampling:

So, beam search is a form of greedy search that does not give an exact highest probability output sequence, but lets us get some number of candidates b, called the beam size. What we do is instead of computing the most likely first word, we compute the b most likely first words (this set of b most likely candidates is called the "beam"). Then to compute the second word, we condition in turn on each of those b first words, and score all the sequences of length 2 obtained that way (b*n of them if n is the size of the vocabulary), then we take the top b of those. Now we can compute the b*n highest scoring length-3 sequences whose length-2 prefixes are in our previous beam, and take the highest-scoring b of those, etc.

Beam search has been applied to RNN work before and generally yields better results more fully exploiting what the RNNs have learned:

The downside is (aside from the work of coding an implementation) that for a large RNN which already takes something like 1s/character, beam search will slow it down even further.

On the other hand, I don't think anyone is seriously trying to apply char-rnn to anything which needs high performance in generating text and for the most part users would be happy to have better sampled text at the cost of some more time when generating but no need for additional text to train on or GPU training or hyperparameter search or architectural innovation. I would suggest that it be enabled by default as well so users don't need to evaluate the tradeoff themselves; a beam of 5 seems like, from the earlier papers, adequate for better quality.

karpathy commented 8 years ago

I recently implemented beam search for an RNN Language Model in context of Image Captioning in NeuralTalk2 repo (https://github.com/karpathy/neuraltalk2/blob/master/misc/LanguageModel.lua#L166). It does makes things work quite a bit better there. I should try to port the option to char-rnn. Thanks!

gwern commented 8 years ago

If you've implemented it before, then even better! I'll be curious to see if beam search does fix my CSS woes, and if it helps out with another char-rnn thing I've been messing with.

pender commented 8 years ago

I wrote a beam search implementation on top of char-rnn some time ago. See sample-beam.lua here: https://github.com/pender/char-rnn/blob/master/sample-beam.lua . A beam width of 2 on a fully trained model eliminates effectively all of the typos even at temperature of 1; I've set this as the default and recommend it, because with a wider beam, the generated text is noticeably conservative in odd ways. I hope this is useful!

gwern commented 8 years ago

pender: I've given it a shot. It seems to crash a lot at randomly with odd errors:

$ th sample-beam.lua cv/lm_lstm_epoch1.09_0.7607.t7 -temperature 0.4 -length 500 -seed 120 -beam 10 -primetext 'M|BIBLE|'
using CUDA on GPU 0...  
Make sure that your saved checkpoint was also trained with GPU. 
If it was trained with CPU use -gpuid -1 for sampling as well.  
Error: File cv/lm_lstm_epoch1.09_0.7607.t7 does not exist. Are you sure you didn't forget to prepend cv/ ?  
/home/gwern/src/torch/install/bin/luajit: cannot open <cv/lm_lstm_epoch1.09_0.7607.t7> in mode r  at /home/gwern/src/torch/pkg/torch/lib/TH/THDiskFile.c:484
stack traceback:
    [C]: at 0x7f05881e2cb0
    [C]: in function 'DiskFile'
    /home/gwern/src/torch/install/share/lua/5.1/torch/File.lua:303: in function 'load'
    sample-beam.lua:76: in main chunk
    [C]: in function 'dofile'
    .../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x00406670
$ th sample-beam.lua cv/jordan-1-vl0.9638.t7 -temperature 0.5 -length 500 -seed 120 -beam 100 -primetext "The Dragon Reborn "
using CUDA on GPU 0...  
Make sure that your saved checkpoint was also trained with GPU. 
If it was trained with CPU use -gpuid -1 for sampling as well.  
creating an lstm... 
seeding with The Dragon Reborn  
--------------------------  
/home/gwern/src/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /home/gwern/src/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109)
stack traceback:
    [C]: at 0x7f1103550bc0
    [C]: in function 'multinomial'
    sample-beam.lua:242: in main chunk
    [C]: in function 'dofile'
    .../src/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk
    [C]: at 0x00406670
The Dragon Reborn 1

Lower temperatures & higher beam widths seem more likely to trigger the multinomial error - beam widths of 50+ hardly ever work (some sort of numerical issue?).

Anyway, when it does work, I see what you mean about being "noticeably conservative". Example:

$ th sample-beam.lua cv/jordanbible-vl0.9763.t7 -temperature 0.7 -length 500 -seed 120 -beam 25 -primetext "BIBLE|"
using CUDA on GPU 0...  
Make sure that your saved checkpoint was also trained with GPU. 
If it was trained with CPU use -gpuid -1 for sampling as well.  
creating an lstm... 
seeding with BIBLE| 
--------------------------  
BIBLE|from the word of the Lord God of Israel. 06:017:017 And the LORD spake unto the children of Israel, saying, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, Thus saith the LORD of hosts, the God of Israel, 
Sample completed in 77.91 seconds.

Different RNN, beam of 11, fixed seed and primetext "The Dragon Reborn ":

The Dragon Reborn ing the Dragon Reborn, and the Dragon Reborn and the Dragon Reborn with the Dragon Reborn, and the Dragon Reborn and the Dragon Reborn in Cairhien. The Dragon Reborn and the Dark One of the Light, and the Dragon Reborn and the Dragon Reborn and the Dragon Reborn and the Dragon Reborn, and the Dragon Reborn and the Great Lord of the Dark, and the Dragon Reborn and the Dragon Reborn

No prime:

Rand and the others. There was nothing to do with the Power, but there was nothing to be done about it. There was nothing to be done about it, and there was nothing to be done about it. There was nothing to be done about it, and there was nothing to be done about it. There was nothing to be done about it, but there was nothing to be done about it. There was nothing to be done about it, but there was nothing to be done about it. There was nothing to be done about it, but there was only one way to

Even an extremely wide beam of 50 doesn't help with this issue:

$ th sample-beam.lua cv/jordan-1-vl0.9638.t7 -temperature 0.5 -length 500 -seed 120 -beam 50 -primetext "The Dragon Reborn "
using CUDA on GPU 0...  
Make sure that your saved checkpoint was also trained with GPU. 
If it was trained with CPU use -gpuid -1 for sampling as well.  
creating an lstm... 
seeding with The Dragon Reborn  
--------------------------  
The Dragon Reborn or no Aes Sedai, the Amyrlin Seat and the Aes Sedai who had been the Amyrlin Seat. The Amyrlin Seat and the Amyrlin Seat of the White Tower, and the Amyrlin Seat of the Hall of the Tower. The Amyrlin Seat and the Amyrlin Seat of the White Tower. The Amyrlin Seat and the Amyrlin Seat of the White Tower, and the Amyrlin Seat, and the Amyrlin Seat of the Hall of the Tower, and the Amyrlin Seat of the Hall of the Tower. The Amyrlin Seat and the Amyrlin Seat of the Hall of the Tower, and the Amyrlin 
Sample completed in 206.62 seconds.

Same prime, but beam of 2:

The Dragon Reborn or no Aes Sedai. The White Tower was a surprising manner, and they were still the ones who had been sent to the Tower of Ghenjei. The Amyrlin Seat had been a Darkfriend, and they were all the world in the Two Rivers, and they were all the Aes Sedai who had been the Dragon Reborn and the Dragon Reborn and the Dragon Reborn. The Shaido were all the way to the city, and they were the only ones who had been the ones who could channel. The men who could channel were still a little stronger than the m

(Kind of odd. I had been expecting b>10 to be a waste of computation, not make results worse.)

A little more systematically looking at beam widths 1-20 using a char-rnn I trained on my website & IRC logs a while back, with the shell command: for b in {1..20}; do th sample-beam.lua cv/gweRRN/lm_lstm_epoch15.36_1.1298.t7 -temperature 1 -length 500 -seed 120 -primetext "A " -beam $b; done; alert. (And one final run with b=100.) Stripping out the boilerplate and collapsing each to a single line:

A never had to... film under / Vogans from Harry Potter": "Those dorcontch Three-Economy aged" good subsidies are estimated to damage dramatic rewards imbalanced with the conjecture or not. What would measure to explain so Adam the company? This religious filter Keyes is to list a few Googlers - Getting transcribed, tiers have reindex picked up essentially the Old One modern woman. If every portly desperate animation yet guy up will spend some per point is lower than the true element of neurons as
A never gets broken into the first place. This leads to the most recent participants in the realm of this project, and the interest of the texts are not the possibilities that they think they can be developed in the original computer scientists. There are plenty of information that can be explained without the problems that could be comprehensible, and that they will increase the interest rate in the first season that contributes to ... ... damage to the computation of their lives, and that their
A comments and the more likely it is that there's no way to do things like this. If you have to deal with this, then you'll have to deal with things like something that could be done and work out what they want to do with their parents. There are people who have their own comments about their work. They're the same things that they were being arrested and they were there and they couldn't be trusted and they are the results of the ... ... financial system and they could have forgotten about their
A combination with the statistics of the studies of the literature on the program and complexity theoretical problems, and therefore there will be something like that in the process of programming. This seems like a better idea than anything else. If you have to assume that there is an explanation for this, then there is no evidence that there is no evidence that people ... ... is the same as they are in the context of their position. There is nothing that matters when they are interested in the
A competence and there are a lot of things that have been completely different, and then there are things that could be difficult for me to read about them. There were a lot of things that could be completely different from them and that there was no such thing as ... 13:14 @Gwern-away ... that there wasn't something I would expect of them to be able to provide them all the way through the context of the problem. They were always trying to figure out what they were saying in the first place, and
A completely disgusted with the activity of the black-markets, and then there was nothing that could have been made better than those that had been interested in the production of the Silk Road 2.0, which had been arrested in the 1990s and then confirmed the complaints that they were allowed to complain about their compensation. There were no possibilities that they were not allowed to sell their ...
... operations to their products. They found that people were allowed to sell their products and
A completely completely different from the past few years, and I have no idea what that means. There are a lot of people who have been able to see them all and they don't even have any of them in the world, and that there is no reason to think that they would be able to provide their own predictions with themselves. If they didn't have the best thing to do, they wouldn't have thought that they were going through their comments on their own projects and that there was no ... ... confidence that th
A completely understandable, and that there was nothing to do with the contents of the internet archive, and there was no way for them to do that. It was something I would have expected from this statement, but that was the only thing that I would have been in the first place when there was something wrong with that. I was thinking that there was something I would expect if there was something wrong with that. I would have thought about things like that, but there was something I would have though
A combination with the majority of computer scientists, and there is no reason to think that there is no such thing as being able to determine whether there is anything to do with the programming language in the first place.' http://www.reddit.com/r/SilkRoad/comments/1pko9y/the_bet_bmr_and_sheep_to_die_in_a_year/ http://www.reddit.com/r/DarkNetMarkets/comments/1rdmv7/what_do_you_think_is_this_for_me/ http://www.reddit.com/r/DarkNetMarkets/comments/1rajsd/what_is_the_most_interesting_one/ http://ww
A never had to be interested in whether there was something wrong with them that they were already interested in the programming language, and that there was something that would have been considered to be the most important thing that they were working with themselves and that they were the same things that they were going through, and there was nothing that would have been one of the most interesting things that had been published in the first place.' ... ... http://www.reddit.com/r/Bitcoin/co
A combination with the most popular characters that have been arrested in the first place, which has been published in the first place in the first place.' http://www.reddit.com/r/SilkRoad/comments/1gxiv7/what_is_the_most_popular_anime/ https://www.reddit.com/r/DarkNetMarkets/comments/35y7i9/complaintwarning_australian_arrest/ https://www.reddit.com/r/DarkNetMarkets/comments/35y7ih/what_is_the_most_interesting_anime/ http://www.reddit.com/r/DarkNetMarkets/comments/1qzxx/something_about_the_silk_ro
A comments and there is no such thing as being able to detect the programming language in the first place.' http://www.reddit.com/r/Nootropics/comments/1nxtyb/california_arrested/ https://www.reddit.com/r/DarkNetMarkets/comments/35u8v4/complaintwarning_anything_is_hard/ http://www.reddit.com/r/DarkNetMarkets/comments/1qzxx/possible_darknet_marketplace/ http://www.reddit.com/r/DarkNetMarkets/comments/1rda42/what_are_you_working_on_silk_road_forums/ http://www.reddit.com/r/DarkNetMarkets/comments/1q
A never actually happened in the first place, and there was a difference between the two men and women in the world, and there was no reason to believe that there was something wrong with them that they would have been more likely to have their children than they were working on themselves. They were the only ones who were the only ones in the world where they were the only ones that they were working with themselves that they were ... ... the only people who were thinking that they would have to
A competence for the first time in the first place, there is no reason to believe that there will be something wrong with the experience of the interpretation that there will be a lot of people who have been able to determine whether there is anything that doesn't mean that there is no reason to believe that the problem is that there is no reason to believe that there is no way to establish the fact that there is no reason to believe that there is no evidence that there is an explanation that the
A wert.com and then there was no way to know whether they were interested in the productivity of the original version of the computational complexity of the programming language, and that there was no reason to believe that there was something that would have been better than the other people who were the only ones who were interested in the programs that were published in the first place. There was nothing wrong with the ... ... statistics of the programming language, and that there was somethin
A comments and then there was no way to know whether they were working with them and what they were talking about whether they were working with them and they would have been able to see that they were working with them and realized that there was no way to know whether they were going through them and that they were the only ones who didn't know what they were talking about whether they were interested in their personal information about whether they were ... 14:39 @Gwern-away ... that they wer
A completely failed to make sure that there was something wrong with the contents of the programming language. There was nothing that happened in the first place, but I didn't realize that there was something I would have been able to find anything which would have been written in the first place.' http://www.reddit.com/r/SilkRoad/comments/1pko9y/the_bet_bmr_and_sheep_to_die_in_a_year/ http://www.reddit.com/r/DarkNetMarkets/comments/1qsvvx/the_marketplace_arrested/ https://www.reddit.com/r/DarkNet
A combination with the internet archive, and then there was something that would have been better than the other personality transactions, and there was something wrong with the rest of the world where there was something wrong with the problem that there was something wrong with that. There was a lot of things that could have been better than anything else. I would have thought that there was no reason to think that there was something wrong with that.' http://www.reddit.com/r/SilkRoad/comments/1
A combination with the most important part of the scientific community in the world. There is no such thing as being able to do anything like that.' http://forum.evageeks.org/viewtopic.php?p=40149#msg1009099 http://www.reddit.com/r/SilkRoad/comments/1gxiv7/what_do_you_think_would_you_think_of_the_complaint/ http://www.reddit.com/r/SilkRoad/comments/1pko9y/the_bet_bmr_and_sheep_to_die_in_a_year/ https://www.reddit.com/r/DarkNetMarkets/comments/35y7h6/complaintwarning_australian_silk_road_drug_s
A ment description of the internet archive and the programming language is one of the most interesting things in the world. This is the only thing that happens in the first place.' http://www.reddit.com/r/DarkNetMarkets/comments/1qzxx/possible_darknet_marketplace/ https://www.reddit.com/r/DarkNetMarkets/comments/35y7h9/what_do_you_think_would_you_think_of_it/ https://www.reddit.com/r/DarkNetMarkets/comments/35y7i9/complaintwarning_australian_arrest/ https://www.reddit.com/r/DarkNetMarkets/comments
A never occurred to me that there was something interesting about the possibility that there was something wrong with the difference between the participants and the probability that the statistically significant difference is the probability that the probability of the statistical significance of the probability of the statistically significant difference between the two standard deviations was associated with statistically significant differences.' http://lesswrong.com/lw/hht/link_soylent_crowdf

I don't see any obvious quality trend other than b=1 is the worst, and it's interesting that hyperlinks only begin to show up with higher beam widths.

Going back to my CSS RNN, using high beam widths triggers the same repetition issue. What about my data URI problem? I tried forcing it to manifest by providing the data URI preamble as a prime text, which worked, since even 10k character is not enough to break out of the URI trap:

$ th sample-beam.lua cv/css.t7 -temperature 1 -length 10000 -seed 120 -primetext "url('data:image/png;base64" -beam 1
...url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAD/AAAAGCqAAAA6AAAAAAAAAAAAAAAAAAAFADwAAAAGSpERAAAMmMYAAIDUAAABCAAEAADSeCXn8NFKCPkaCUYnACgm0AAAABXRFSEAAAC7COCTbCJ++g5w+PAXFO55CUUef/nOwTkMagBU0CUPEAkQRBcSNw4SkJtUwdB1fyaZH6nbTAwc82UozrOgNqAHvpJe+/7/5Qt4e1DbKr9UmrMlUpoYjPKtonkxbhMSKlAW80PulP/ZPJttWaXbETALECLw/cc2AQv7H2W3CxSIK/+vzQnF6zoAovqMsp6ux0LoNDdlcVq1Wt4gN51JcbplXOVOcNux8dUWnuYn11RQ7XT06dn/Vpq6ziArXwLbQr9kyXyAdgWFhtID3kUFLQhVj7LqJGbcjsDz6bEJzx5sZhwVarOsZ1svWgZyAeVOb/ZhDLDcKxlPlxhEHMOlPbemo/ummFQKGUPaeE6stlTIPXrRK+5uK
...Wl5ruQLwidE6XZTLFjPfX0qhcUy2yDFCJicv1EuIclc4NtEdnBISR7VmEn/AOquTC5Zn5Bj8EOd1kUu8e9r2iQNZJqRH90HzZnzkMAARuAbiCwqe/Pq8/tL99y816Eyj5FccM0BqJoFk6NtdJgb57edln2KavEInVMwffNYqJIyLycbpprdrHbj5mQnqXf/N8i2D2BYYXquAv2m3xuSPz0emR7bXJoeACEJ3wg0CInQUSn5GgqB2x-wc2h4yDMsUyYVJcV6LTda5BKLWdNsEthlJ968UgCfYK832uuUKAGQEGqrkIgEouP6Un5ZBeZjLNtYsoGjOJWMoGrT4vMkugcofDjjMohnAwrWo7NEL
Sample completed in 13.21 seconds.

I did another loop generating 5k characters to see if at any particular beam width it'd break out.
Around b=6 it falls into an even worse trap, where after starting like url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAAAAAAACAAAAAAAAAAAAAA it just repeats 'A' endlessly. (It doesn't fall into it for all beam widths >=6 - eg b=19 looked like regular gibberish - but most of them. Watching generation, there's also a curious speed change: it seems to pause for a very long time after a few initial AAAs and then suddenly emit all the rest of the As up to the character limit.) As of b=31, generating 5k characters takes 526s and I had observed no breakouts up to this point, so I killed the run.

However, this might reflect the idiosyncratic nature of my trained RNNs with potential overfitting etc since I'm not very good at this. If someone wants to try it out on other datasets like Tiny Shakespeare and see if there's similar catastrophic repetition at higher beam searches, that would be interesting. (Could pender's implementation be wrong? I still find it weird that more beam search makes things much worse.)

So to summarize:

your code is a little buggy and needs some work on robustness to not crash routinely on users
b=2-5 seems to generally improve quality
contrary to my hopes, no level of beam search solves my data URI CSS problem

pender commented 8 years ago

I think it's recognized theoretically that widening the beam can result in an inferior result, and I seem to recall that the papers that mention beam search in the context of sequence prediction nets generally use a beam width of 2, or another low value. But in this case my intuition is that a sequence prediction net trained to predict the next single token will have stretches when it is very confident about the next token (e.g. when reciting a really long word, or after seeing the first three characters of the string "http://www."), and occasions when it is much less confident (e.g. when starting the first word of a new sentence). The wider your beam, the harder you're selecting for sequences that contain long stretches of predictable characters. So you end up in a loop of a particularly predictable string of text, stitched together in a location where it wasn't confident about what would come next anyway. Whatever minimal probability penalty it takes by looping back to the start of the phrase when it was in a toss-up situation anyway is outweighed by the gains of being so confident about the rest of the phrase. If I've got this right, then the flaw is in the net rather than the search algorithm, and if you searched exhaustively and found the sequence of a particular length with the globally highest cumulative probability, it would look similar.

The first crash that you noted above seems to be the result of a misspecified network name, and I suspect that the second is an issue that occurs when your beam is wider than the number of characters in your net's vocabulary. (Given the poor results of wide beams, I haven't done much testing of beam widths greater than 10, nor any for beams widths greater than 50.) I just pushed a fix to the latter.

ghost commented 8 years ago

Thanks again to pender for the excellent 'sample-beam.lua' which greatly improves sampling results with char-rnn.

I've also been testing this word level version: https://github.com/larspars/word-rnn

Is it possible to modify 'sample-beam.lua' to work properly at the word level?

Current sampling output is concatenated:

Thequickbrownfoxjumpedoverthelazydog.

Instead of

The quick brown fox jumped over the lazy dog.

I'm new to lua/torch and would appreciate any suggestions to modify the code.

This link provides a python solution as a possible reference:

http://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words

Cheers,

karpathy / char-rnn

Feature: beam search for improving global quality of new text samples #138