Closed truas closed 6 years ago
Hi truas.
To my reading of https://github.com/tensorflow/hub/issues/46, https://github.com/tensorflow/hub/issues/110, and the documentation of https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2, you can train that module, in the following sense:
If you do
embed = hub.Module(
"https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2",
trainable=True)
embeddings = embed(batch_of_sentences)
and then build a model on top of embeddings
, define a loss, and add an optimizer for that loss, then running the optimizer will update the trained weights of the embed
module. This lets you fine-tune the module in the context of the problem you use it for, and hopefully enable some quality gains.
Fine-tuning for a problem isn't quite the same as training an embedding afresh on a new corpus:
I hope that has provided some clarification.
Hello @arnoegw ,
Thank you for the response. In your example, it's assumed I would be using English word tokens in my sentence-batch, to possibly improve my model right? In my case, I have a specific token format for the corpus/words (to differentiate their senses), so I would not "match" any of them in the initial pre-trained model.
However, I do have token embeddings models for my format (e.g. word2vec), but I don't think it can be loaded and used instead of the pre-trained USE since it won't have USE's specifications.
Regards, T.
Hi @truas,
unfortunately yes, the vocabulary is fixed, so as @arnoegw described, this module can be used for fine-tuning, but is not actually designed for full training that would replicate the paper.
As for the second part, even if we considered some really dirty hacks, it doesn't seem likely that we could just load the embeddings, because USE is not a simple lookup table.
Hello @vbardiovskyg,
Thank you for the clarification! T.
Hello,
Sorry if this is a duplicate by any means, but I've looked into #36, #46 and #110 - the answer is still a little cloudy for me.
Is there any way to train a new model using USE for our own custom data corpus? I'm working with semantic/synonym sets (e.g. synsets from WordNet and BabelNet) and would like to train a "synset corpus-like" using the USE and explore its features.
I was able to do this using word2vec and paragraph2vec (doc2vec) and it works good for my experiments.
Regards, T.