keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.7k stars 19.43k forks source link

Implementing Seq2Seq Models #5738

Closed Joshua-Chin closed 3 years ago

Joshua-Chin commented 7 years ago

A number of us have shown interest in implementing seq2seq models in Keras #5358, so I'm creating a central issue for it. We can discuss design decisions and prevent duplication of work here.

There are a number of existing pull requests related to seq2seq models:

There is currently a seq2seq example in keras/examples: addition_rnn.py. However, the implementation could be greatly improved. It currently supports only a single layer encoder-decoder architecture. It also does not perform teacher forcing.

Feel free to comment on any ideas, questions or comments you might have.

@farizrahman4u @fchollet @israelg99.

fchollet commented 7 years ago

I will check out the two PRs after we are done releasing Keras 2. Looking forward to reading everyone's feedback on how to best handle seq2seq models, API-wise. The first PR was a success (symbolic states).

unrealwill commented 7 years ago

Hello,

My 2 cents.

What is the status of Tensorflow scan (specially regarding multiple outputs, and documentation)? Has it converged yet to Theano's scan?

What I currently do for seq2seq and like is wrapping Theano scan directly inside my customs Layers. I remember trying to use K.rnn, but quickly hit the wall of not being able to return multiple outputs.

Teacher's forcing is a capital part to get any decent rnn to converge. Specially when there are multiple plausible answers to a question during the training, if so then training is just noise fitting. The addition_rnn.py example is just teaching beginners to shoot themselves in the head. Teacher's forcing introduces its own problems, one is exposure bias. Which can be mitigated by using readout (i.e. feeding it's own output to the next input) and beam searching. The readout should ideally be sampled, which is not beautifully done inside scan, so I needed to pass some sampled before scan uniform noise, and appropriately stop some gradients. The readout branch of a model is an integral part of the design of the model and shouldn't be flagged in.

Facebook's MIXER, comes back and tells more on these details, and will nicely bridge users to the RL paradigm, where they may easily combine more advanced losses (ctc, generative adversarial ones) , so they are not forced to do some beam searching (available in tensorflow but not in theano) to generate sequences, and where they may also benefit from the more recent RL algorithms.

.

The alternative way, is unroll everything. The only reason for scan is to be able to handle inputs of dynamic size, but what we usually do is bound them with a max_sequence_length. So if we accept never to use dynamic size, (which can be mitigated by creating new models (sharing the same weights) on the fly (or pre-compiled following a 2^max_seq_length pattern) ), if we accept to manually transfer the hidden states from one time step to the next, and if we accept longer compilation times (and maybe forget about some theano optimizations (btw let me digress : if you don't do any graph optimizations, "compilation" is just constructing a topological order of the computation graph which is exactly the recorded operations in the context during construction so ideal cost should be 0 and with a graph executor on the gpu you can have dynamical graphs and all this "compilation" nonsense debug-ability nightmare goes away) ); Then we can happily stay with the usual non recursive layers, and we don't suffer the restrictions on nested scans and we have finer control for parallel computing.

.

I see two roads ahead. Let's just take the one less traveled by.

bstriner commented 7 years ago

I'll mention this repo by @farizrahman4u . You shouldn't use normal keras layers inside symbolic loops so this tries to provide a framework for making those inner layers. If you always unroll, you can use normal keras layers. Has anyone actually tested the performance difference when using symbolic loops?

https://github.com/datalogai/recurrentshop

The best solution would be a framework and set of components for recurrent networks. Either recurrentshop or adding something extra to keras-contrib.

chmp commented 7 years ago

Maybe relevant: I sketched (and partially implemented) a way to build general recurrent models reusing as much functionality as possible from keras itself. The idea is that the user specifies a transition model which computes a new state from the current state and the current input. A SimpleRNN-like layer can be implemented as:

hidden = Input((128,))
input = Input((10,))

x = Dense(128, activation='relu')(input)
x = Merge(mode='sum')([hidden, x])
new_hidden = Activation('sigmoid')(x)

# a layer to be used in keras.models.Model or keras.models.Sequential
rnn = RecurrentWrapper(
    input=[input],
    output=[new_hidden],
    bind={hidden: new_hidden},
    return_sequences=True,
)

Here the input, output arguments are the current to input/output vector of the sequence. The recurrence is indicated by the bind argument, that specifies that the new_hidden output should be fed back into the hidden input.

To use the input sequences themselves inside the model, corresponding inputs can be passed via the sequence_input argument. For example:

hidden = Input((128,))
input = Input((10,))
sequence = Input((None, 10))

x = Dense(128, activation='relu')(input)

# a function of the full sequence
x_seq = GlobalAveragePooling1D()(Dense(128)(sequence))

x = Merge(mode='sum')([hidden, x, x_seq])
new_hidden = Activation('sigmoid')(x)

# a layer to be used in keras.models.Model
rnn = RecurrentWrapper(
    input=[input, sequence],
    output=[new_hidden],
    bind={hidden: new_hidden},
    return_sequences=True,
)

My current state can be found here. It's tf only and certainly requires more thorough testing.

bstriner commented 7 years ago

Just whipped up a bare-bones S2S model here: https://github.com/bstriner/keras-seq2seq

Heavily undocumented. Interested in everyone's thoughts.

It takes an input sequence and an output sequence. It trains to minimize categorical cross entropy of predicted output sequence. In test mode, each predicted output is fed back into the network.

The interesting part is how it handles sequences of different lengths within a batch. Input and output sequence are concatenated into a single matrix and a mask indicating input, output, and padding is created as a separate matrix. The mask controls when the unit is doing input or output and what values contribute to the loss.

It trains on shakespeare bigrams. Input is a sequence of characters from word 1 and it predicts word 2 character-by-character. Not the most interesting dataset but easy enough. Anyone have a good dialog dataset or smth?

cry: [they] denmark: [antonio] winds: [wife] you: [antony] thou: [most]

You can run it with stochastic generation or deterministic argmax. I tested in theano but it should hopefully work on tf as well.

The big issue with layers like this is that it feels like I'm reimplementing LSTM, Dense, Embedding, etc. each time. Maybe a good place to try using recurrentshop.

Another annoying issue is the need to concatenate and splice tensors, when I would just have multiple inputs and outputs in raw theano. It means I have to mix datatypes which is ugly. Maybe that needs a PR.

Cheers

bstriner commented 7 years ago

Note on keras-seq2seq: I had to write custom backend functions for cumsum and zeros. Really need to add those two to backend. I want to dynamically create a vector of zeros, not instantiate a variable. Haven't tested tensorflow but it might work.

Also would be nice to put some more random streams into backend, so you don't have to generate on the cpu. Currently the RNG is in numpy, concatenated to the rest of the input, and passed to the RNN.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

lucasjinreal commented 7 years ago

Has official keras contains seq2seq implementation now?

stephenhky commented 7 years ago

Yes, I wonder about the status of seq2seq in keras too. Tensorflow has tensor2tensor as well, but I am not sure how theano is doing right now.

NumesSanguis commented 7 years ago

Is this a good implementation of seq2seq?: https://github.com/farizrahman4u/seq2seq

iamyuanchung commented 7 years ago

Has official keras contains seq2seq implementation now?

gsoul commented 6 years ago

Blog post on Seq2Seq by Keras author: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html . Please note, that it doesn't have attention mechanism from what I can see there.

alvations commented 6 years ago

For anyone looking for the direct link to the demo/example in the keras repo, it's https://github.com/fchollet/keras/blob/master/examples/lstm_seq2seq.py

simra commented 6 years ago

Glad to come across this thread. I have a number of questions about lstm_seq2seq and how to save/restore the decoder. Would you prefer I ask them here (in this issue or a separate one) or over on stackoverflow?