Backproping Through Argmax

NickShahML commented 8 years ago

Hey Liamb, really appreciate the work you're doing here.

For a long time, I've wanted to apply Adversarial networks to NLP. The main problem is that in a seq2seq generator, you have to use the argmax to predict the next word.

The problem with this is that you can't backprop through the argmax function -- it is non-differentiable. I thought of feeding the discriminator a direct embedding and doing nearest neighbors to predict best char/word.

I noticed from your commits that you have had some instability. If you want to discuss this further, I would be happy to chat with you on skype. My username is 'leavesbreathe'. This is an exciting idea you have. I'm convinced this is the way to go after studying this paper:

http://arxiv.org/abs/1511.05101

Thanks!

liamb315 commented 8 years ago

Hey there and thanks for the message!

Yes, this non-differentiability in the model presents considerable challenges. Whether one chooses to argmax or to sample from the probability distributions, you've immediately complicated the training. However, one way I'm considering attempting to skirt this challenge is as follows,

At each step of the seq2seq generator, sample or argmax to produce the next token to feed into the generator, however, instead of feeding in that token to the discriminator, pass the entire softmax distribution to the discriminator at each step. From this architecture, you can then calculate gradients w.r.t. each token in the vocabulary. It should be noted that this is an odd thing to do, though. Now the discriminator will be evaluating not a sequence of tokens, but rather, a sequence of probability distributions. Furthermore, the discriminator has no mechanism to see which token was chosen from the distribution at each time step by the generator, in order to produce this sequence of probability distributions.

This seems like a potentially related idea to your passing of the direct embedding.

Would certainly like to Skype and discuss further, my username is the same as my GitHub: liamb315. When would be good for you? I have a pretty remarkable availability recently since I'm presently recovering from ACL surgery, so I can make most reasonable PST times work.

Talk to you soon, Liam

On Wed, May 4, 2016 at 8:56 AM, LeavesBreathe notifications@github.com wrote:

Hey Liamb, really appreciate the work you're doing here.

For a long time, I've wanted to apply Adversarial networks to NLP. The main problem is that in a seq2seq generator, you have to use the argmax to predict the next word.

The problem with this is that you can't backprop through the argmax function -- it is non-differentiable. I thought of feeding the discriminator a direct embedding and doing nearest neighbors to predict best char/word.

I noticed from your commits that you have had some instability. If you want to discuss this further, I would be happy to chat with you on skype. My username is 'leavesbreathe'. This is an exciting idea you have. I'm convinced this is the way to go after studying this paper:

http://arxiv.org/abs/1511.05101

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/liamb315/CharacterGAN/issues/4

William Fedus

NickShahML commented 8 years ago

I agree with you that the best step is to pass the argmax for the generator to produce the next word/char.

Passing the entire softmax distribution would be very tricky though. Suppose your vocabulary was 40k words (which is a small vocab size).

The discriminator then has to consider 40k inputs per timestep, which is pretty considerable. You could think of it as a 40,000 word embedding in a way. Now with chars it would be different, as you would only have 100 or 150 vectorization.

But the problem with chars is simply that humans don't think by letter, they think by phrase or word. There are plenty of people who speak fluent English who can't spell a single word. The point is that when you generate content char by char, you're making the task incredibly difficult.

Added you on skype -- would be happy to talk! Really interesting stuff for sure :+1:

liamb315 commented 8 years ago

Definitely, character-level certainly is making the task considerably more difficult and I agree that it almost certainly doesn't match any sort of cognitive behavior (for instance, setnences are stlil otfen redabale even if letetrs are jubmled). Additionally, operating at character-level also lengthens the time scale that the RNN architectures need to store information and back-propagate gradients; in English, this is approximately a lengthening factor of ~5.

However, one nice gain is that we don't require a dedicated GPU just to handle a 100k word softmax layer. Also, I think it's quite interesting while operating on inputs near the base-level of any hierarchical structure (characters, pixels, sound-pressures, etc.), to see the extent to which useful hierarchical information may be implicitly learned via training.

I'll be on Skype most of the day, feel free to ping me whenever! Look forward to hearing your thoughts on this.

NickShahML commented 8 years ago

Hey Liam, I trying resending you a contact request so I think you should be added? I've sent you a few messages. Just message me on skype and we can go from there. Perhaps I'm doing something wrong. Again its "leavesbreathe"

I agree that hierarchy is nice, but at the same time, practicality is important. There are subword neural nets may be the best way to go as a compromise.

You don't need a dedicated GPU for a regular softmax of 40k words. You can use tricks like a sampled softmax or hierarchical softmax which I believe resemble more of human intuition. Make it much less expensive :+1:

liamb315 commented 8 years ago

Odd, I've already confirmed the request but don't see any messages from you. I suppose you don't see my message back, either? Let me try a voice call, perhaps that might work.

Ah and interesting, I need to look further into sampled softmax and hierarchical softmax!

On Thu, May 5, 2016 at 1:02 PM, LeavesBreathe notifications@github.com wrote:

Hey Liam, I trying resending you a contact request so I think you should be added? I've sent you a few messages. Just message me on skype and we can go from there. Perhaps I'm doing something wrong. Again its "leavesbreathe"

I agree that hierarchy is nice, but at the same time, practicality is important. There are subword neural nets may be the best way to go as a compromise.

You don't need a dedicated GPU for a regular softmax of 40k words. You can use tricks like a sampled softmax or hierarchical softmax which I believe resemble more of human intuition. Make it much less expensive [image: :+1:]

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/liamb315/CharacterGAN/issues/4#issuecomment-217262066

William Fedus

liamb315 commented 8 years ago

Reinforcement learning, or specifically, REINFORCE, may be a compelling route forward to deal with non-differentiable operations in the graph. I'll keep you posted as I develop experiments.

NickShahML commented 8 years ago

Hey Will, I think you'er right that reinforce could potentially work really well. However, the biggest problem might be potential action space as others have noted in generation with reinforce.

liamb315 / CharacterGAN

Backproping Through Argmax #4