Closed dirkweissenborn closed 6 years ago
I agree with Pasquale. But should this be done first?
I also agree with Pasquale. We were talking with Thomas about re-writing the vocabulary.
Sounds good to me. But then we would need to handle embeddings differently, e.g., as argument of an input module, if needed. However, keep in mind, that in my code I use the vocabulary of the embeddings directly as my vocabulary. Please make sure that this functionality still exists.
Btw, embeddings are not always part of the TF graph (e.g., for FastQA) so I always need to give them to my model prior to loading/creating my model.
@pminervini I also don't think that it is a good idea to store large embedding matrices all the time in the checkpoints. Consider the case where you use all glove embeddings. In TF this would become bigger than 4GB on disk.
@dirkweissenborn in the embedding matrix you usually only have the embeddings for the tokens in the train+validation+test sets; that's a fraction of GloVe
.
@dirkweissenborn I just checked the size of a TF checkpoint for a SNLI model using all such GloVe embeddings, and it's roughly 90M! It sounds doable.
Just to be clear: are we talking about saving pre-trained embeddings, or saving model embeddings?
@pminervini I think this is again model and application specific. Think about a use case where you don't have an official test set (e.g., SQUAD). In this case you want ALL your embeddings because when your trained model is run on the test set you want as much coverage as possible. So I see that the usual workflow might be different, but we should be very careful with our assumptions.
Again, please think about the fact that models should be useful after training as well, where you do not necessarily know what kind of words could be the input. I have the feeling that we should really keep such a scenario in mind and develop jack accordingly.
@riedelcastro we talk about pre-trained embeddings, at least from my point of view. Embeddings that get finetuned have to be saved along with the model of course. However, most of the words are not seen during training so I am always in favor of a hybrid approach where you keep pre-trained embeddings as they are and add additional dimensions for finetuning.
Yeah, okay. Here is what I think
Vocab
and (Pretrained)Embeddings
should be two distinct classes/functions etc.Both the model parameters and the (pre-trained or fine-tuned) embeddings can be loaded/saved using tf.train.Saver
[1], e.g.
embedding_layer = tf.Variable(..., name='embeddings')
saver = tf.train.Saver({'embeddings': embedding_layer})
saver.restore(session, "/checkpoints/GloVe.ckpt")
[..]
save_path = saver.save(session, "/checkpoints/GloVe.ckpt")
So we don't have to deal with two disjoint saving mechanisms.
[1] https://www.tensorflow.org/api_docs/python/tf/train/Saver
@pminervini please see my comment above, we should not store large embeddings along with each model, that consumes too much disk space and is slow on saving. Furthermore, I think TF doesn't even support tensors that are so large, so you will not even be able to put all embeddings into TF without partitioning them. Please trust me on this, I had my fair share of experience with this exact problem in the past.
I agree with @riedelcastro , this is what I had in mind as well.
@dirkweissenborn no I proposed to store the fine-tuned GloVe embeddings e.g. in /checkpoints/GloVe.ckpt
, and use them across different models (which will be saved disjointly from the embeddings). Could this work ?
@pminervini what do you mean by fine-tuned? If you mean pre-trained, then yes we could simply convert them to TF first and load them, but I think TF doesn't support very large tensors or graphs in general, so this might get tricky. For now I would just keep them outside of TF.
@dirkweissenborn I mean that you can minimise the loss by updating the embedding layer - in that case, you need the (fine-tuned) word embeddings along with the model parameters.
It's done e.g. in some SOA RTE models: https://arxiv.org/abs/1609.06038 (ESIM in the jtr
codebase)
@dirkweissenborn also, in many KBP models, the "word" (entity and predicate) embeddings are literally the only parameters of the model.
@pminervini I understand that, but you still cannot put all GloVe embeddings naively into a TF graph. Just keep the words that are fine-tuned in TF and feed the rest from the outside, if needed. Anyway, I strongly believe that a hybrid approach as explained above is even better than fine-tuning pre-trained embeddings. However, everyone can handle that however he/she wants it. Just give the user the option to access pre-trained embeddings outside of TF.
So - should we delegate this to the InputModule
s ? (And e.g. do this for SQuAD, but not for SNLI etc.)
I think there is also another related question that determines the right answer here: How are pre-trained embeddings fed to the TF model? Are we passing the relevant vectors into the TF module via feed_dicts? Or are we creating a TF variable that we assign the pre-trained embeddings to? Or a big tf.constant in cases where we don't tune the embeddings further?
I use this trick for setting only some rows of the word embeddings matrix (and leaving the others with the native TF initializer): https://github.com/tensorflow/tensorflow/issues/4151#issuecomment-244089247
In short, both the word index and the word embedding are fed via a placeholder, and there is a single operation setting one row of the embedding matrix (given by the index) to the word embedding.
Here's my code:
word_idx_ph = tf.placeholder(dtype=tf.int32, name='word_idx')
word_embedding_ph = tf.placeholder(dtype=tf.float32, shape=[None], name='word_embedding')
assign_word_embedding = embedding_layer[word_idx_ph, :].assign(word_embedding_ph)
[..]
session.run(assign_word_embedding, feed_dict={word_idx_ph: 5, word_embedding_ph: v1})
session.run(assign_word_embedding, feed_dict={word_idx_ph: 10, word_embedding_ph: v2})
session.run(assign_word_embedding, feed_dict={word_idx_ph: 15, word_embedding_ph: v3})
If we want the choice whether to fix pretrained and train out-of-vocab embeddings, and we want the Vocab object to be completely unaware of pre-trained embeddings, then we need to loop over all symbols while initializing the embedding layer, and @pminervini's trick should work fine. @pminervini did you test whether you can control the trainable
property this way?
The alternative, stacking a trainable tensor with out-of-vocab embeddings and a fixed one with pre-trained ones, would require the Vocab to have knowledge of which symbols have pre-trained embeddings. It's fast, but does not completely separate vocab and embeddings.
One of the reasons why the Vocab I used before did have this knowledge (i.e., for which symbols we have pretrained vectors), is for the case where you get test data with unseen symbols: after training, while encoding the test data, you want to know whether you need to assign the UNK id, or fetch an available pre-trained embedding for it. This problem doesn't occur if you load the Vocab with combined train/dev/test data as @pminervini objected, but that depends on the applications we have in mind. And you still want to avoid loading all available pre-trained embeddings up front.
@pminervini did you test whether you can control the trainable property this way?
@tdmeeste yes you can (it's how it is actually done in inferbeddings
)
This problem doesn't occur if you load the Vocab with combined train/dev/test data as @pminervini objected, but that depends on the applications we have in mind.
I didn't think about SQuAD (and all cases where you do not have access to the test set), as @dirkweissenborn mentioned. You were right last sunday, handling OOV words you have the embedding of should be doable (at least for some models).
Maybe each specific InputModule
or ModelModule
could take care of this?
@riedelcastro this should be up to the programmer/user.
@pminervini @tdmeeste So for my use cases I simply need access to embeddings, actually I do not rely on any vocabulary at all and feed the embedded words directly or feed an embedding matrix for each batch and a "local" vocab for the batch. This is a bit weird but I needed this for my previous project. WSo any solution that gives me access to pre-trained embeddings in the input module would work for me. I think your use cases are much more involved.
Yikes, #243 is actually a duplicate. I will fix this by actually separating vocab and embeddings. Hoping we have good test coverage!
Actually, what are positive and negatives ids? @pminervini @tdmeeste maybe?
I really believe the
Vocab
should be completely disjoint from the embeddings/word representations - that would also fix this issue I think.Then the embeddings can be easily stored along with all other parameters in a TF checkpoint, by using
tf.train.Saver
(I'm doing that and it's really convenient IMO).