is the Sequence to Sequence learning right?

EderSantana commented 9 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?

wxs commented 8 years ago

@LeavesBreathe excited for that: would be a very useful addition.

NickShahML commented 8 years ago

Hey Guys just as another update, I dabbled more with TensorFlow, and I am getting some pretty good results with it. The lowest perplexity I've gotten has been 31. Though with Fariz's Seq2Seq, I got a perplexity of 45 (and that's with no attention) so there is plenty of optimization still left to do!

charlesollion commented 8 years ago

I'm also working on a generic attention decoder, but waiting for multiple input support on graph rather than using a concat, I feel that it's cleaner. I'm organizing a hackathon on attention models this Saturday so I'll keep you posted. If you guys have any suggestion that would be appreciated.

NickShahML commented 8 years ago

Good to hear. My attention has shifted more to tensorflow lately, and I'm trying to implement curriculum learning right now. Its really hard for me to determine whether tensorflow or keras is a better platform. I think they each have their strengths.

Fariz's platform is really good as well. Right now, because of attention, I am still getting better results in tensorflow.

Also, Fariz. I recently had a chat with kkastner and he mentioned a possible improvement with averaging the hidden states in the decoder. I'm going to test that over the next two weeks (each run takes like 6 days). If I get improved results, I'll report back here on this thread as this is something pretty easy to implement.

viksit commented 8 years ago

@farizrahman4u @LeavesBreathe thanks for a very illuminating discussion here. Would you guys talk a bit about the incoming data into the seq2seq or deepLSTMs that you're using? I'm a bit unclear right now about how to model my data in order to do the predictions so far.

NickShahML commented 8 years ago

@viksit there are alot of ways you can input your data into seq2seq. The most basic is one-hotting, where each word is assigned one word. Then you can do what's called an embedding layer, which will compress this vector into a readable format. If you scroll up this thread, @simonhughes22 helped me understand this.

viksit commented 8 years ago

Ah, I'm already using custom trained word vectors. Some thoughts. I'm using a toy dataset here. I did see @simonhughes22s comment above, and rather than "book-ending" my sentences - I'm simply using end of sentence markers.

What is your name?
My name is X.
How old are you?
I am Y years old.

I then transform this data into,

X_train,y_train
What is your name?<EOS>,My name is X.<EOS>
How old are you?<EOS>,I am Y years old.<EOS>

Question - Do I need to do a mask to make sure that these symbols aren't learned? (eventually)
I then convert this to (one hot for simplicity) and then an embedding layer
Add the layers sequentially - just a simple seq2seq and then compile
Question - how do you think about converting it from (nb_samples, dims) to (nb_samples, timesteps, dims)?
How long does it take for you to build the s2s model you described?

An end-to-end example would actually be quite helpful to infer some of the details.

NickShahML commented 8 years ago

@viksit ,lately, I have been working with Tensorflow lately. Not saying that Fariz's platform is bad or anything, but tf has just given me a better understanding of what's going on.

As such, I don't mask anything. I simply pad with 0's at the end for input and output.

each word is a timestep, so I'm not quite sure what you're asking.

When you say build...do you mean compile? about 10 mins. To train? about 5 days, depending on data size and compute power. For me I have two 980 ti's and it takes me about 5 days for a network with 1million samples.

I don't have an end to end example, but you might want to check out this: http://www.tensorflow.org/tutorials/seq2seq/index.html

farizrahman4u commented 8 years ago

Hi everyone, I have made the following updates to my seq2seq implementation:

Seqseq model with attention
Hidden state propagation throughout LSTM stack
A basic SimpleSeq2seq model which uses only pure Keras to serve as performance baseline, against which other models can be compared.
Cleaner API. Now the seq2seq API is more similar to the Keras Recurrent API.
Cleaner code with doc strings
Better readme

www.github.com/farizrahman4u/seq2seq

NickShahML commented 8 years ago

Thanks alot fariz, I'm really swamped right now, but this is nice to have for sure. Really appreciate the help.

jgc128 commented 8 years ago

Cool!

pralav commented 8 years ago

I have been following this thread. I am not exactly doing text generation or machine translation. @farizrahman4u @LeavesBreathe @simonhughes22 your insights were really useful. @LeavesBreathe I would like to know if tensorflow gave you good results. I am going to tryout @farizrahman4u model as well. So I just wanted to know if tensorflow worked better. I would like to get a suggestion as well if something like CNN for feature extraction followed by LSTM encoder-decoder. I am working at character level and not word level. Thank a lot in advance..

NickShahML commented 8 years ago

pralav I will say tensorflow is really good but it is incredibly slow and hogs a ton of memory as of right now. They have plans to improve, but be prepared to spend alot of money on graphics cards if you plan on using tensorflow.

Integrating CNN's within LSTM's is something that could potentially lead to interesting results. Have to be careful how you structure it though.

pralav commented 8 years ago

@LeavesBreathe Thanks a lot. Hopefully will inform once i experiment with more models. Thank you.

aliabbasjp commented 8 years ago

@LeavesBreathe So what do you suggest using instead of tensorflow for NLP tasks which will help in minimize the memory hog?

NickShahML commented 8 years ago

Well you can wait until tensorflow improves its memory hogging problems, buy more gpu's, or simply switch to theano which is much faster and doesn't hog the memory nearly as much as of right now =/

karakiz commented 8 years ago

Hi @farizrahman4u ,

In this post https://github.com/nicolas-ivanov/debug_seq2seq it is mentioned that the seq2seq implementation of yours is not performing well, and perhaps not generating sequences properly. Have you looked into those examples ? Has your recent commits, especially the attention modules, were to fix some of the issues mentioned there ? Or has there been any other tests like this which show good results with your implementation, perhaps benchmark with tensorflow ?

Thanks

louisabraham commented 7 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right?

Isn't it possible to just skip the input_length parameter and give just the input_dim? Or input_shape=(None, input_dim)?

devm2024 commented 6 years ago

@NickShahML : isn't now people use encoder-decoder architecture for sequence to sequence learning?

keras-team / keras

is the Sequence to Sequence learning right? #395