Training new corpus for USE

truas commented 6 years ago

Hello,

Sorry if this is a duplicate by any means, but I've looked into #36, #46 and #110 - the answer is still a little cloudy for me.

Is there any way to train a new model using USE for our own custom data corpus? I'm working with semantic/synonym sets (e.g. synsets from WordNet and BabelNet) and would like to train a "synset corpus-like" using the USE and explore its features.

I was able to do this using word2vec and paragraph2vec (doc2vec) and it works good for my experiments.

Regards, T.

arnoegw commented 6 years ago

Hi truas.

To my reading of https://github.com/tensorflow/hub/issues/46, https://github.com/tensorflow/hub/issues/110, and the documentation of https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2, you can train that module, in the following sense:

If you do

embed = hub.Module(
            "https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/2",
            trainable=True)
embeddings = embed(batch_of_sentences)

and then build a model on top of embeddings, define a loss, and add an optimizer for that loss, then running the optimizer will update the trained weights of the embed module. This lets you fine-tune the module in the context of the problem you use it for, and hopefully enable some quality gains.

Fine-tuning for a problem isn't quite the same as training an embedding afresh on a new corpus:

The vocabulary is fixed (although that should be good enough for common English text).
Weights start with their pre-trained values, not random initialization (which can be a big plus if your new corpus is not too far from the original one).
To train like the U.S.E. paper describes, you have to set up the training task and procedure described there. (This module uses the DAN encoder.)

I hope that has provided some clarification.

truas commented 6 years ago

Hello @arnoegw ,

Thank you for the response. In your example, it's assumed I would be using English word tokens in my sentence-batch, to possibly improve my model right? In my case, I have a specific token format for the corpus/words (to differentiate their senses), so I would not "match" any of them in the initial pre-trained model.

However, I do have token embeddings models for my format (e.g. word2vec), but I don't think it can be loaded and used instead of the pre-trained USE since it won't have USE's specifications.

Regards, T.

vbardiovskyg commented 6 years ago

Hi @truas,

unfortunately yes, the vocabulary is fixed, so as @arnoegw described, this module can be used for fine-tuning, but is not actually designed for full training that would replicate the paper.

As for the second part, even if we considered some really dirty hacks, it doesn't seem likely that we could just load the embeddings, because USE is not a simple lookup table.

truas commented 6 years ago

Hello @vbardiovskyg,

Thank you for the clarification! T.

tensorflow / hub

Training new corpus for USE #155