uclnlp / jack

Jack the Reader
MIT License
258 stars 82 forks source link

Do not store embeddings along with the vocab when pickling SharedResources #202

Closed dirkweissenborn closed 6 years ago

pminervini commented 7 years ago

I really believe the Vocab should be completely disjoint from the embeddings/word representations - that would also fix this issue I think.

Then the embeddings can be easily stored along with all other parameters in a TF checkpoint, by using tf.train.Saver (I'm doing that and it's really convenient IMO).

riedelcastro commented 7 years ago

I agree with Pasquale. But should this be done first?

JohannesMaxWel commented 7 years ago

I also agree with Pasquale. We were talking with Thomas about re-writing the vocabulary.

dirkweissenborn commented 7 years ago

Sounds good to me. But then we would need to handle embeddings differently, e.g., as argument of an input module, if needed. However, keep in mind, that in my code I use the vocabulary of the embeddings directly as my vocabulary. Please make sure that this functionality still exists.

dirkweissenborn commented 7 years ago

Btw, embeddings are not always part of the TF graph (e.g., for FastQA) so I always need to give them to my model prior to loading/creating my model.

dirkweissenborn commented 7 years ago

@pminervini I also don't think that it is a good idea to store large embedding matrices all the time in the checkpoints. Consider the case where you use all glove embeddings. In TF this would become bigger than 4GB on disk.

pminervini commented 7 years ago

@dirkweissenborn in the embedding matrix you usually only have the embeddings for the tokens in the train+validation+test sets; that's a fraction of GloVe.

@dirkweissenborn I just checked the size of a TF checkpoint for a SNLI model using all such GloVe embeddings, and it's roughly 90M! It sounds doable.

riedelcastro commented 7 years ago

Just to be clear: are we talking about saving pre-trained embeddings, or saving model embeddings?

dirkweissenborn commented 7 years ago

@pminervini I think this is again model and application specific. Think about a use case where you don't have an official test set (e.g., SQUAD). In this case you want ALL your embeddings because when your trained model is run on the test set you want as much coverage as possible. So I see that the usual workflow might be different, but we should be very careful with our assumptions.

Again, please think about the fact that models should be useful after training as well, where you do not necessarily know what kind of words could be the input. I have the feeling that we should really keep such a scenario in mind and develop jack accordingly.

dirkweissenborn commented 7 years ago

@riedelcastro we talk about pre-trained embeddings, at least from my point of view. Embeddings that get finetuned have to be saved along with the model of course. However, most of the words are not seen during training so I am always in favor of a hybrid approach where you keep pre-trained embeddings as they are and add additional dimensions for finetuning.

riedelcastro commented 7 years ago

Yeah, okay. Here is what I think

pminervini commented 7 years ago

Both the model parameters and the (pre-trained or fine-tuned) embeddings can be loaded/saved using tf.train.Saver [1], e.g.

embedding_layer = tf.Variable(..., name='embeddings')

saver = tf.train.Saver({'embeddings': embedding_layer})

saver.restore(session, "/checkpoints/GloVe.ckpt")
[..]
save_path = saver.save(session, "/checkpoints/GloVe.ckpt")

So we don't have to deal with two disjoint saving mechanisms.

[1] https://www.tensorflow.org/api_docs/python/tf/train/Saver

dirkweissenborn commented 7 years ago

@pminervini please see my comment above, we should not store large embeddings along with each model, that consumes too much disk space and is slow on saving. Furthermore, I think TF doesn't even support tensors that are so large, so you will not even be able to put all embeddings into TF without partitioning them. Please trust me on this, I had my fair share of experience with this exact problem in the past.

dirkweissenborn commented 7 years ago

I agree with @riedelcastro , this is what I had in mind as well.

pminervini commented 7 years ago

@dirkweissenborn no I proposed to store the fine-tuned GloVe embeddings e.g. in /checkpoints/GloVe.ckpt, and use them across different models (which will be saved disjointly from the embeddings). Could this work ?

dirkweissenborn commented 7 years ago

@pminervini what do you mean by fine-tuned? If you mean pre-trained, then yes we could simply convert them to TF first and load them, but I think TF doesn't support very large tensors or graphs in general, so this might get tricky. For now I would just keep them outside of TF.

pminervini commented 7 years ago

@dirkweissenborn I mean that you can minimise the loss by updating the embedding layer - in that case, you need the (fine-tuned) word embeddings along with the model parameters.

It's done e.g. in some SOA RTE models: https://arxiv.org/abs/1609.06038 (ESIM in the jtr codebase)

pminervini commented 7 years ago

@dirkweissenborn also, in many KBP models, the "word" (entity and predicate) embeddings are literally the only parameters of the model.

dirkweissenborn commented 7 years ago

@pminervini I understand that, but you still cannot put all GloVe embeddings naively into a TF graph. Just keep the words that are fine-tuned in TF and feed the rest from the outside, if needed. Anyway, I strongly believe that a hybrid approach as explained above is even better than fine-tuning pre-trained embeddings. However, everyone can handle that however he/she wants it. Just give the user the option to access pre-trained embeddings outside of TF.

pminervini commented 7 years ago

So - should we delegate this to the InputModules ? (And e.g. do this for SQuAD, but not for SNLI etc.)

riedelcastro commented 7 years ago

I think there is also another related question that determines the right answer here: How are pre-trained embeddings fed to the TF model? Are we passing the relevant vectors into the TF module via feed_dicts? Or are we creating a TF variable that we assign the pre-trained embeddings to? Or a big tf.constant in cases where we don't tune the embeddings further?

pminervini commented 7 years ago

I use this trick for setting only some rows of the word embeddings matrix (and leaving the others with the native TF initializer): https://github.com/tensorflow/tensorflow/issues/4151#issuecomment-244089247

In short, both the word index and the word embedding are fed via a placeholder, and there is a single operation setting one row of the embedding matrix (given by the index) to the word embedding.

Here's my code:

word_idx_ph = tf.placeholder(dtype=tf.int32, name='word_idx')
word_embedding_ph = tf.placeholder(dtype=tf.float32, shape=[None], name='word_embedding')
assign_word_embedding = embedding_layer[word_idx_ph, :].assign(word_embedding_ph)
[..]
session.run(assign_word_embedding, feed_dict={word_idx_ph: 5, word_embedding_ph: v1})
session.run(assign_word_embedding, feed_dict={word_idx_ph: 10, word_embedding_ph: v2})
session.run(assign_word_embedding, feed_dict={word_idx_ph: 15, word_embedding_ph: v3})
tdmeeste commented 7 years ago

If we want the choice whether to fix pretrained and train out-of-vocab embeddings, and we want the Vocab object to be completely unaware of pre-trained embeddings, then we need to loop over all symbols while initializing the embedding layer, and @pminervini's trick should work fine. @pminervini did you test whether you can control the trainable property this way?

The alternative, stacking a trainable tensor with out-of-vocab embeddings and a fixed one with pre-trained ones, would require the Vocab to have knowledge of which symbols have pre-trained embeddings. It's fast, but does not completely separate vocab and embeddings.

One of the reasons why the Vocab I used before did have this knowledge (i.e., for which symbols we have pretrained vectors), is for the case where you get test data with unseen symbols: after training, while encoding the test data, you want to know whether you need to assign the UNK id, or fetch an available pre-trained embedding for it. This problem doesn't occur if you load the Vocab with combined train/dev/test data as @pminervini objected, but that depends on the applications we have in mind. And you still want to avoid loading all available pre-trained embeddings up front.

pminervini commented 7 years ago

@pminervini did you test whether you can control the trainable property this way?

@tdmeeste yes you can (it's how it is actually done in inferbeddings)

This problem doesn't occur if you load the Vocab with combined train/dev/test data as @pminervini objected, but that depends on the applications we have in mind.

I didn't think about SQuAD (and all cases where you do not have access to the test set), as @dirkweissenborn mentioned. You were right last sunday, handling OOV words you have the embedding of should be doable (at least for some models).

Maybe each specific InputModule or ModelModule could take care of this?

dirkweissenborn commented 7 years ago

@riedelcastro this should be up to the programmer/user.

@pminervini @tdmeeste So for my use cases I simply need access to embeddings, actually I do not rely on any vocabulary at all and feed the embedded words directly or feed an embedding matrix for each batch and a "local" vocab for the batch. This is a bit weird but I needed this for my previous project. WSo any solution that gives me access to pre-trained embeddings in the input module would work for me. I think your use cases are much more involved.

undwie commented 6 years ago

Yikes, #243 is actually a duplicate. I will fix this by actually separating vocab and embeddings. Hoping we have good test coverage!

undwie commented 6 years ago

Actually, what are positive and negatives ids? @pminervini @tdmeeste maybe?