Pre-trained word embeddings

WaelSalloum commented 7 years ago

Hi Ottokart,

Thank you for making this work available. I have a couple questions:

1) The parameter: this represents the size of a single BiRNN layer. Have you tried stacking layers on top of each other?

2) In the paper you have a system that uses pre-trained word vectors. Can you share the code that uses them in the neural network? I see it runs on plain text, but I wonder if you have a script that requests a word2vec binary file (or some other format of word vectors) to be loaded as input to the NN.

3) Have you tried using an embedding layer as the first layer in the network before BiRNN? And do you think it's possible to use pre-trained vectors (general purpose) as input to an NN that has a word embedding layer (which hopefully will learn task specific embeddings) as the first layer and then your NN (BiRNN and attention)? Can you share your advise on that?

ottokart commented 7 years ago

Hi!

Actually, the is the size of only one direction in the BiRNN state. Total size of BiRNN state is 2x that. I have not tried stacking multiple BiRNNs before the attention. I did try making the output RNN (the one that uses the attention) bidirectional, but this did not seem to work very well.
In the paper I used the word vectors from glove.6B.50d.txt. With the latest commits I now added the code for using these embeddings. To do that you need to give the path to the pretrained embeddings in text format in the data.py script configuration (the PRETRAINED_EMBEDDINGS_PATH constant in the header). I also have a small script for converting word2vec binary file into pickled Python dict: https://gist.github.com/ottokart/673d82402ad44e69df85
There already is an embedding layer before the BiRNN layer :) But about using pre-trained vectors as an initialization or replacement for these embeddings... my experience is that when you have little data, then pre-trained vectors that are trained on a lot of data can help noticeably, but if you have a lot of training data (a few 100M of words) then the difference tends to vanish. Although, starting with sensible initialization in the embeddings might still enable the training to converge faster. As you saw from the paper, the general purpose embeddings helped with the punctuation restoration task on a relatively small IWSLT dataset.

WaelSalloum commented 7 years ago

Thank you Ottokart for the prompt and detailed reply.

aarasteh commented 7 years ago

Thank you Ottokart :)

ottokart / punctuator2

Pre-trained word embeddings #5