keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.31k stars 19.38k forks source link

is the Sequence to Sequence learning right? #395

Closed EderSantana closed 8 years ago

EderSantana commented 8 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?

gkeskin07 commented 8 years ago

Thanks a lot guys!

simonhughes22 commented 8 years ago

Note that often they use a dictionary for performance purposes when implementing those embedding layers. But that is equivalent to what I described AFAIK. For any pedantic types :)

NickShahML commented 8 years ago

Hey @simonhughes22 , thanks for the pointers again. I've made some good headway so far.

I wanted to revisit one issue we discussed earlier: What is the best way to format the Y_train so that we can predict words? Which of these ideas do you think is the best?

Idea 1 -- One Hot all Words

I've read in several places that doing a softmax over 2k terms is just a very bad idea. You face the curse of dimensionality, meaning it gets exponentially harder to predict words. So if you have a vocab of 100k words you would have to do a 100k softmax. This seems like the option of last resort.

I've read in some papers doing 100k softmaxes but only with 8 Titans or so. Here's an example: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Idea 2 -- Hierarchial Softmax

That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

I think this is the best idea, but the hardest to implement because of the heavy writing. Some work has been done here: https://github.com/fchollet/keras/issues/438.

I feel that I'm not experienced enough to fully pull this off yet. But maybe in the future I will attempt to do this and submit a PR.

Idea 3 -- Use Regression to find the closest word in Word2Vec

Another idea is to abandon the categorical softmax all together, and just simply predict vectors. Obviously the neural net is not going to predict the exact vectors. So you take the vectors it does produce for each word, and find the closest word that matches those vectors. I don't know if you can do this in word2vec, but I imagine you could. So for each word, you would predict lets say 32 vectors that describes the word.

I think this is complicated to implement. For each sequence of lets say 20 words, you're asking the neural net to produce 32 x 20 = 640 numbers. This seems like a nightmare to me. I guess you would use a linear/tanh activation, mse objective, and RMSprop optimizer?

Idea 4 -- Use the Clustering Idea Discussed Earlier

Not to bring back bad ideas, but I do think the clustering discussed would work well here. The reason being that you do a softmax, but it is only over ~400 terms for 80k words. I've ran this to predict individual words (not sequences of words), and it always gets the cluster id and word id right after epoch 1.

Advantage: You only have to softmax over 400 terms. You get an added bonus that the word id will be near 0 or 1 (since all of the words in a clustered are ordered by frequency)

Disadvantage: You have to predict two individual integers per word. You also have to one hot, but its only over 400 numbers.

I don't mean to bother too much, but I just wanted to hear your thoughts on these options. It will take me at least a few days to implement/test each idea, so I would rather start with the best one and see what happens. Thanks!

sergeyf commented 8 years ago

Howdy,

I've been doing something like your idea 3 using pre-trained vectors as both inputs and outputs:

word_vector_size = 300 # this is the dimensionality of the word vectors I already have
dense_size = 512
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,max_len,d)
model.add(M)
model.add(GRU(word_vector_size, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(d, activation="linear")) 
#optimizer = rmsprop(lr=0.001, clipnorm=10) # another option
optimizer = adam(lr=0.001,clipnorm=10) # works for me
model.compile(optimizer=optimizer, loss='mse')

(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking) doesn't work without an embedding layer at the moment.)

The inputs have shape: (n_samples, max_len, 300) and the outputs are (n_samples, 300). The vectors are dependency-based pre-trained word vectors from here: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

I don't have concrete results yet, but it does learn, and is a much smaller output space than the one-hot idea (your idea 1). Before this I tried that with 10k word classes and it was also learning, but VERY slowly (as I only have a laptop GPU - GTX 870m); I found you need to use really heavy gradient clipping with rmsprop (clipnorm=0.1), or no learning would take place. Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.

My goals are maybe different from yours - I want smart encodings of sentences such that nearest neighbors give sensible results, and better than TFIDF BoW nearest neighbors. It seems like something in this general RNN direction should work, but I probably need a bigger machine to do the training =)

NickShahML commented 8 years ago

@sergeyf Thanks for the tips!

(Sorry for the ugly Masking hack - there is currently some bug such that just doing model.add(Masking) doesn't work without an embedding layer at the moment.)

Thanks for mask hack -- I was trying to figure this out this morning!

(as I only have a laptop GPU - GTX 870m)

I gotta tell you man, getting a maxwell modern gpu is so worth it. I can't even imagine trying to do this on a laptop. You can try to get decent GPU's on ebay. Regularly see 980 TI's going for $600 and Titan X's going for $850. Just make sure they weren't used for bitcoin mining. This post helped me alot: http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/

Also, the output space was much larger -> 10k by the number in the mini-batch, whereas with regression your output space is only 300 by the number in the mini-batch.

Right. This is the whole idea with regression. You only have a 300 numbers per word (or however many you choose to use with word2vec.).

Currently my goal is to take a sentence, and predict the next sentence that makes somewhat logical sense. I think this is similar to what you're doing? It is similar to translation but the "translation" is the next sentence that should come.

Also, how are you converting your 300 number outputs to a word (when you do model.predict)? Is there a function in word2vec that does that (I did a little searching and couldn't find one.)

sergeyf commented 8 years ago

Happy to help!

I am not converting the 300 number output to a word. I just leave it as is. Once the training is done, I feed entire sentences into the network, but take the representation that comes out of the RNN (before Dense) and use that as my representation of the sentence. Then I do NN in the sentence space. The idea being that I just fed a sequence of words into an RNN, so its final representation should be a sentence. Does that make sense? This is why I was pointing out that we may have different goals.

NickShahML commented 8 years ago

@sergeyf apologies for my misunderstanding. I understand what you're getting at. I guess I'm still kinda stuck though as which of the four ideas I mentioned above would be best for next-sentence prediction. =/

I definitely think what you're doing is smart, and could potentially work really well!

sergeyf commented 8 years ago

No worries!

I am not sure why you wouldn't just predict the next sentence as represented by word2vec vectors?

So input is "Horses run." as represented by [x_horses, x_run] and output is "They run quickly." as represented by [x_they, x_run, x_quickly]. Why ever convert things into categories instead of just leaving them always as vector embeddings?

simonhughes22 commented 8 years ago

@LeavesBreathe instead of predicting one hot vectors replace those one hots with the pre-trained vectors, word2vec or those dependency embeddings. Either way you output will be a matrix, you just drop one of the 2 dimensions from the size of your vocabulary to the size of the embedding. If that makes sense

NickShahML commented 8 years ago

@sergeyf @simonhughes22 Alright, I just feel stupid. Both of you are telling me the answer, and I can't understand it.

I completely understand x_train. The 3d matrix of (nb_samples, timesteps, word_vectors). And I can do the same with y_train as well. But when I do model.predict, won't the model have to predict word vectors for each word?

Suppose you use 32 scalars per word (when you set word2vec size = 32). This means that for each word in the sentence, you must predict 32 scalars per word correct? Not only that, you can't use a softmax. And what are the odds that the network is going predict the exact 32 scalars that correspond to a word? This is why I'm so lost.

simonhughes22 commented 8 years ago

IIRC you stack the one hot or the embeddings vertically, and then you have one column per output step up to max sequences. I may have the rows and cols reversed, but that's the idea

simonhughes22 commented 8 years ago

make sure, if predicting embedding, you use RMSE error. I don't think any of the other error metrics are correct for that. At prediction time (once trained) do a cosine sim search on most similar vectors

NickShahML commented 8 years ago

At prediction time (once trained) do a cosine sim search on most similar vectors

Thank you! This is what I thought you had to do. So the final network should look like this correct?

model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (1,maxlen,word2vec_dimension)
model.add(M)
model.add(JZS1(hidden_variables, input_shape=(maxlen, word2vec_dimension), return_sequences = False)) #note for the input shape that you did not put the number of samples 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables)) #consider adding another dense here
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
for z in range(0,number_of_decoding_layers):
    model.add(JZS1(hidden_variables, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="linear")) #consider adding another timedistributedense here
model.compile(loss='mean_squared_error', optimizer='rmsprop')

Particularly: Loss = mse (or are you saying I should do rmse here instead?) Optimizer = rmsprop or adam Activation = Linear (or is there a better option?)

simonhughes22 commented 8 years ago

I think so

simonhughes22 commented 8 years ago

I've also had it work with binary cross-entropy, as that doesn't technically do a soft max, but it's not really meant to be used like that. fchollet recommended I do RMSE. But you could try that if desired. Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.

sergeyf commented 8 years ago

Does RMSE vs MSE make much of a difference? It would change the steepness of curvature near the minima and/or saddle points but not sure which one is preferred. Probably depends on the problem (as with everything)...

On Thu, Oct 29, 2015 at 3:06 PM, Simon Hughes notifications@github.com wrote:

I've also had it work with binary cross-entropy, as that doesn't technically do a soft max, but it's not really meant to be used like that. fchollet recommended I do RMSE. But you could try that if desired. Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-152341583.

simonhughes22 commented 8 years ago

@sergeyf that's probably an empirical question. RMSE is a popular error metric as errors in the training data is assumed to be normally distributed according to the CLT, and so RMSE is the 'best' metric to minimize under those assumptions when you have real numbers not ordinals - at least in theory.

NickShahML commented 8 years ago

Outputs need to be in the range 0-1 to use bce I think, so you word vectors would need to be in that range.

I think Word2Vec does this, so I'll try that! Interesting that you can use bce -- would not have thought of that, but hopefully it will perform better than mse. If I end up trying RMSE, I'll submit a pull request for it.

sergeyf commented 8 years ago

There's a discussion about this happening on r/machinelearning: https://www.reddit.com/r/MachineLearning/comments/3qyn0m/sequence_to_sequence_mapping_via_lstm/

Their claim is that this doesn't work as well as classification!

simonhughes22 commented 8 years ago

@sergeyf this link from that discussion would back that up: https://github.com/yandex/faster-rnnlm. In summary, hierarchical softmax is used for speed, but you are sacrificing some accuracy for that efficiency gain. Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems. NCE seems to be the way to go for these models, but I am unsure how you'd do that in a sequence learning model.

You 'could' try learning to predict a bag of words (BOW) representation instead of a sequence, i.e. a single vector, the same length as your vocabulary, with binary representations for each word. Then train a second model, a simple language model, to translate this into the most probable word sequence. But you've thrown away any word order in passing the predicted BOW between the two models, and so this probably wouldn't work as well, particularly as you can often re-order the words in a sentence and completely alter the meaning. But it very much depends on the problem you are solving. If it's not a linguistics problems but some other sequence, the ordering may be very easy to determine from a BOW type output.

sergeyf commented 8 years ago

@simonhughes22 thanks for that link - looks really interesting. My particular goal is to do NN queries for various sentences. The sentences tend to be on the shorter side, so it may indeed be not a big deal to lose the ordering. I've tried some models that seem to suggest that, but nothing conclusive yet.

NickShahML commented 8 years ago

@sergeyf @simonhughes22 , This reddit discussion is really good. I'll post my findings there that I've had so far.

Also, if you're interested, I submitted a PR for Root mean square error, so its on keras now if you want to use it.

Doing a softmax over a sizable vocabulary is probably not feasible for a lot of real-world problems.

I completely agree. Doing a softmax over 100k words seems like the wrong direction, even though that Google Seq to Seq paper did it (needed 8 titans though). This is why I suggested the clustering: so that you could use a softmax (and therefore, categorical_crossentropy) and not have to resort to mse or rmse. The reddit discussion seems to be criticizing mse hard.

In the meantime though, I've been directly inserting vectors and then using cos distance to predict the next sentence. Haven't had much luck though yet.

I'll comment back here when I have more info

simonhughes22 commented 8 years ago

@sergeyf this might be of use to you, although given how fast the field moves it's relatively old: http://www.utstat.toronto.edu/~rsalakhu/papers/topics.pdf - Hinton and Salakhutdinov paper

sergeyf commented 8 years ago

Thanks @simonhughes22 - I've seen a number of similar papers that all seem to use additional noise and stacked denoising autoencoders. It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)

I also found two non neural-network approaches.

One that makes use of pretrained word vectors and then does a vectorized word mover's distance between sentences - http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf Python code for the first link is here: https://github.com/mkusner/wmd

And another that marginalizes out the noise that is added for the denoising autoencoder, yielding a closed-form solution: http://arxiv.org/pdf/1301.6770v1.pdf

NickShahML commented 8 years ago

It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)

I asked about this about a month ago and @EderSantana said it would be easy to implement, but I'm not sure if it would be worth it right now. RNN's might still be better than DBNs?

simonhughes22 commented 8 years ago

@sergeyf thanks, cool, i'll check it out

sergeyf commented 8 years ago

I'm not an expert, but seems like they are fundamentally different enough to both be worthwhile?

On Sun, Nov 1, 2015 at 12:07 PM, LeavesBreathe notifications@github.com wrote:

It would be cool if Keras had an RBM implementation so I could try this out without massive hacking =)

I asked about this about a month ago and @EderSantana https://github.com/EderSantana said it would be easy to implement, but I'm not sure if it would be worth it right now. RNN's might still be better than DBNs?

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-152859525.

simonhughes22 commented 8 years ago

RBM's are pretty simple to code up. Not all that different to an autoencoder though, which you could do in keras as is.

NickShahML commented 8 years ago

Hey Guys, just as an update. I used about 250mb of text to train. Basically, I'm getting results where it repeats the same word 4 or 5 times, then moves onto the next word. This was inputting words as 128 vectors, and then doing cos distance on output vectors.

This was with rmse, adam, and linear activation. Good news though is that it nails "sentence start" and "sentence end" every time. Found the best results with 2 LSTM encoders (hidden = 128), and 3 JZS1 decoder layers (hidden = 256). Also, I tried 2 time-dist-dense layers and it made it slightly better.

From the reddit discussion yesterday, I'm going to try inputs as vectors, and outputs as clusters + word id. It will probably take me a week to set this up properly and test. Will report results back here when I get them!

simonhughes22 commented 8 years ago

@LeavesBreathe you normally need to train it for a long time to get past that. I was never able to get great results, as it was more of an intellectual exercise, but it did seem to improve over time, and I didn't leave it running for a long period of time.

NickShahML commented 8 years ago

Good to know. I was training each model configuration for 50 epochs, adjusting the learning rate when loss was rising.

What I want to do is compare:

50 epochs of using cos distance with linear + rmse

to

50 epochs of using clustering with softmax + categorical_cross

And see which one performs better. Then do 200 epochs on the one that performs better =). With the cos distance, it takes me about 16 hrs for 50 epochs.

simonhughes22 commented 8 years ago

Crikey. Are you using the GPU?

NickShahML commented 8 years ago

yea 980 Ti...why you think that is too slow? Most of my time is wasted just loading matrices...i'm gonna upgrade from 16gb to 24 gb ram in a few days ....eventually though, I'm thinking of doing 64 gb of ram (but that's like in 4 or 5 months)

simonhughes22 commented 8 years ago

I'm mot a hardware guy, but that sounds fast. 16g of RAM is good for me. I am assuming you just have a lot of data. If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.

NickShahML commented 8 years ago

Wait i'm still confused -- are you saying training 50 epochs in 16 hrs is too slow or too fast?

If you haven't already, check that Theano is utilizing the GPU, I had to jump through a few hoops to ensure that.

Took me a solid day when I was starting to make sure GPU was being used, but I assure you that it is (gets pretty hot 60C).

Keep in mind I'm using 250mb of text, which translates to approx to 2 million samples. I train with batch size of 4096...but like I said, most of the time is spent loading matrices (which is why I want more ram)

simonhughes22 commented 8 years ago

Got it. Yeah you just have a lot of data. Mine is much faster, 50 epochs doesn't take me too long.

oleole commented 8 years ago

@simonhughes22 I've been following this thread and trying the network structure you shared on some conversation text data, but couldn't get it to learn something useful. Looks like the only thing it learns is outputting sentence start and sentence end. The only difference is I'm using categorical_crossentropy for loss function. I've added special tokens for sentence start/end, left padding, masking. My sequence_len = 100, nb_words=10000.

BTW for sentence prediction, is greedy approach the right thing to do? (argmax on final the one-hot output layer for each step in the sequence), or choose a word with the predicted probability?

Could you shed some insight on what could go wrong?

simonhughes22 commented 8 years ago

@oleole I can only speak to what i've tried. I have much shorter sequence lengths (inputs and outputs) and a pretty small vocabulary. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting. I never got great results as I've mentioned a number of times, but it did start to predict sequences. So you may jsut need to let it train for a really long time. In the academic papers on this subject, the training times are pretty long from what i've read, it's a really hard problem. Also the longer the sequence the harder it will be to learn relationships over the length of that sequence.

Is it possible to shorten the sequences somehow (e.g. predict the first 5 or 10 words in the sequence) or just predict the next word only? Predicting the next word would undoubtedly be easier, and that would then be a language model, and you can even feed the prediction in at the end of the existing input and iterate that way to create the full output sequence, although you'll have to write a little bit of extra code to do that. I can't remember if I mentioned that here or on another issue. I guess i'd try that first for people who are having problems learning anything. I'd also start with a smallish dataset so you can iterate faster and test out the model structure before letting rip on the full dataset. The other thing to try, is to predict the output word vectors as opposed to the one hot encodings. In that case you would use RMSE or bce as your error metric (please see discussion above).

oleole commented 8 years ago

@simonhughes22 Thanks a lot for your long reply. I do feel it's learning slowly, but probably need much more time and data to move forward.

I agree learning the full sequence is a very challenging problem. From the "A Neural Conversational Model" paper, it's actually doing what you suggested, greedily learning the next word. I'll definitely give it a try.

The other idea of making word vector as target is also very interesting, but kind of discouraging hearing the results from other people on reddit. Several things I'm still not quite clear are:

  1. Given a sequence input, the output will be n vectors (n=time_step). As @sergeyf mentioned, the first RNN output can be used as representation of input sequence (vector with size=hidden layers). However, for sequence prediction problem, we first need to find representation for the predicted sequence, and use that to search NN sequences. But there are no vector representation for predicted sequence (decoders), so how do you get it to work @LeavesBreathe ?
  2. If we use pre-trained vector for output, as well as initialization for embedding layer, then output vector will be fixed, and embedding weight will keep updating? Also how to represent sentence start/end in the output vector? just some random fixed vectors?
oleole commented 8 years ago

@sergeyf For your experiment using pre-trained vectors as both inputs and outputs. What are the targets you are trying to predict? It looks like a word vector. Is it the next word given input sequence?

sergeyf commented 8 years ago

Yes, exactly. I am predicting the next word vector from a sequence of word vectors. It didn't go very well!

On Mon, Nov 2, 2015 at 10:13 PM, oleole notifications@github.com wrote:

@sergeyf https://github.com/sergeyf For your experiment using pre-trained vectors as both inputs and outputs. What are the targets you are trying to predict? It looks like a word vector. Is it the next word given input sequence?

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-153255706.

oleole commented 8 years ago

Have you tried skip-thoughts vector https://github.com/ryankiros/skip-thoughts?

NickShahML commented 8 years ago

But there are no vector representation for predicted sequence (decoders), so how do you get it to work @LeavesBreathe ?

@oleole , I haven't really gotten much to work first. I too get sentence start and end tokens. However, my biggest problem (with doing y targets as word vectors) is that you get repeated words over and over again.

I don't fully understand your question. Your encoding layer will create a vector rep of your input sequence. The repeat vector layer repeats that vector rep for all timesteps of your y output. I personally like to do two LSTM layer for encoding because I feel it capture more salient features. But this might be because I have a huge dataset (about 2 mil sentences).

Apologies if I didn't answer your question.

The skip thoughts paper is something of interest as well, and something I want to look into eventually. Right now, as mentioned above, I"m working on the clustering output. Taking longer than expected to get the matrices right.

. @LeavesBreathe - also relevant to yoru question: When I trained it, It would start predicting the start and stop tokens, as those are the easiest to learn and most common, then it would move onto the most common words, and then repeat the more common phrases, before starting to predict something more interesting.

This is really interesting to me @simonhughes22. I feel that this is a strong sign that you need more training data (I know you have a limited set). Usually if you have a situation where more epochs results in better results, that openly tells you that more data would be improve the model more instead.

Imagine you doubled your data. Then your model would get to the same level (loss) in half the epochs, assuming your data is perfect. Still really good to know that patience is key in this.

simonhughes22 commented 8 years ago

@oleole skip thoughts looks promising, i've been meaning to try it out. The output varies as you are predicting a different output sequence for each input sequence. So I don't get the 'the output is fixed' comment. The start and end vectors can just be all 0's (or all 1's).

NickShahML commented 8 years ago

Hey guys,

So as an update, I've tried the input word vec (16 vectors) to cluster output. I've only ran it for 3 days, so there are plenty of different hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its a slow process. However, I can tell you this:

Softmax + cross_entropy is much much better than linear regression/cos distance

This might look lame, but here's some sample output. Notice the word diversity!:

_posthumous respectively acute support bleeding association enough pregabalin gluten but hot into : in some of cocaine provides regular medical herrerasaurus review control the widespread coat upper_airway conclusively after glucose such_as linking the number and have cns may nucleus inflammation include respectively psa , patient trauma than occurs vascular evidence medical pseudounipolar dichotomybetween

But I'm wondering about something a little more novel: changing the design of sequence to sequence.

Before, we were using a repeat vector for each timestep like in the model shown:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables_encoding))
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables_decoding, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here
model.compile(loss='categorical_crossentropy', optimizer='adam')

But now, lets consider that we do not use the repeat vector. But rather, we simply use a TimeDistrubtedDense layer instead to get the right amount of timesteps for our target:

model = Sequential()
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = True)) 
model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables_decoding, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Before trying this for 4 or 5 days, I wanted to get your guys take on it. Thanks alot!

sergeyf commented 8 years ago

I think you would have to have the return_sequences=True for the encoding LSTM for this to work (second line).

So, if I understand, the major difference would be is that instead of feeding the decoder the final encoding at its every time step, you would be feeding intermediate encodings at every time step. It's hard to say whether it would be better or worse. Presumably the final encoding with return_sequences = False has more info in it than intermediate ones, and the decoder having access to the final encoding should be better by intuition, but who knows what the truth is. RNNs surprise me often. =)

On Thu, Nov 5, 2015 at 2:38 PM, LeavesBreathe notifications@github.com wrote:

Hey guys,

So as an update, I've tried the input word vec (16 vectors) to cluster output. I've only ran it for 3 days, so there are plenty of different hyperparameters to try. To get to 50 epochs, it takes about 2 days, so its a slow process. However, I can tell you this:

_Softmax + crossentropy is much much better than linear regression/cos distance

This might look lame, but here's some sample output. Notice the word diversity!:

_posthumous respectively acute support bleeding association enough pregabalin gluten but hot into : in some of cocaine provides regular medical herrerasaurus review control the widespread coat upper_airway conclusively after glucose such_as linking the number and have cns may nucleus inflammation include respectively psa , patient trauma than occurs vascular evidence medical pseudounipolar dichotomybetween

But I'm wondering about something a little more novel: changing the design of sequence to sequence.

Before, we were using a repeat vector for each timestep like in the model shown:

model = Sequential() model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) model.add(Dropout(dropout)) model.add(Dense(hidden_variables_encoding)) model.add(Activation('relu')) model.add(RepeatVector(y_maxlen)) for z in range(0,number_of_decoding_layers): model.add(LSTM(hidden_variables_decoding, return_sequences=True)) model.add(Dropout(dropout)) model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) here model.compile(loss='categorical_crossentropy', optimizer='adam')

But now, lets consider that we do not use the repeat vector. But rather, we simply use a TimeDistrubtedDense layer instead to get the right amount of timesteps for our target:

model = Sequential() model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) model.add(TimeDistributedDense(y_maxlen)) #consider adding another timedistributedense here for z in range(0,number_of_decoding_layers): model.add(LSTM(hidden_variables_decoding, return_sequences=True)) model.add(Dropout(dropout)) model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) model.compile(loss='categorical_crossentropy', optimizer='adam')

Before trying this for 4 or 5 days, I wanted to get your guys take on it. Thanks alot!

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154218066.

NickShahML commented 8 years ago

I think you would have to have the return_sequences=True for the encoding LSTM for this to work (second line).

My mistake, you're absolutely right. Thanks for saving me the compile error.

Presumably the final encoding with return_sequences = False has more info in it than intermediate ones, and the decoder

Why is this necessarily true? For your encoding layer, couldn't use stack more LSTM's/TimeDistributedDense layers (in the 'encoding' portion) to make it just as sophisticated? Forgive me for my lack of understanding if this is an obvious answer.

I just don't like the repeatvector part of the original model. It seems to me that your repeating the same vector for each timestep. But wouldn't be useful it you were giving a different vector for each timestep?

After all, you expect to see a different word for each timestep, so wouldn't make more sense to give a different vector input for each timestep?

sergeyf commented 8 years ago

First, let me say that I don't know what actually works or doesn't - the following is my rationale for what I believe should work better.

Let's say you have a sequence of words to encode: 'the cat is dancing'

In the first encoder-decoder type, you get an encoding(the cat is dancing) that is repeated to every step of the first decode layer.

In the second type that you are proposing, you get the following encodings:

encoding(the) encoding(the cat) encoding(the cat is) encoding(the cat is dancing)

But! Presumably encoding(the cat is dancing) has strictly more encoded info than encoding(the cat is) (or the others). So we are providing as much info as possible to the decoding layer's every step. In a way, this lets the decoding layer know about the whole sequence at every step, not just what has been seen up until now. It should be an advantage. Not sure if it actually is one...

NickShahML commented 8 years ago

@sergeyf what you're saying makes sense. I like that your providing as much information as possible. I feel kind of foolish for proposing this in the first place.

If my GPU is ever free, I'll just try this idea just for fun.

In the meantime, I'm going to try reversing input and see if I get better results (like the google seq seq paper did). I'm also going to try adding on more TimeDistributedDense and LSTM/JZS1 layers. If I get anything better, I'll let you guys know.

sergeyf commented 8 years ago

Please don't feel foolish! Sometimes things that sound reasonable are wrong. And we have no idea if they are unless we just try random stuff :) That's my favorite way to learn. On Nov 5, 2015 4:42 PM, "LeavesBreathe" notifications@github.com wrote:

@sergeyf https://github.com/sergeyf what you're saying makes sense. I like that your providing as much information as possible. I feel kind of foolish for proposing this in the first place.

If my GPU is ever free, I'll just try this idea just for fun.

In the meantime, I'm going to try reversing input and see if I get better results (like the google seq seq paper did). I'm also going to try adding on more TimeDistributedDense and LSTM/JZS1 layers. If I get anything better, I'll let you guys know.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154248098.