Closed mvss80 closed 4 years ago
I am curious about this as well.
I was able to look at the tokens using tensorflow debugger and the operation "module_apply_default/text_preprocessor/tokenize/StringSplit:1"
Tensor "module_apply_default/text_preprocessor/tokenize/StringSplit:1:DebugIdentity":
dtype: object
shape: (4,)
array([b'<S>', b'arya', b'stark', b'</S>'], dtype=object)
Tokens fetched from the USE operation "module_apply_default/text_preprocessor/tokenize/StringSplit:1":
['<S>', 'the', 'recent', 'rapid', 'progress', 'of', 'neural', 'networkbased', 'natural', 'language', 'understanding', 'research', 'especially', 'on', 'learning', 'semantic', 'text', 'representations', 'can', 'enable', 'truly', 'novel', 'products', 'such', 'as', 'smart', 'compose', 'and', 'talk', 'to', 'books', 'it', 'can', 'also', 'help', 'improve', 'performance', 'on', 'a', 'variety', 'of', 'natural', 'language', 'tasks', 'which', 'have', 'limited', 'amounts', 'of', 'training', 'data', 'such', 'as', 'building', 'strong', 'text', 'classifiers', 'from', 'as', 'few', 'as', '100', 'labeled', 'examples', 'below', 'we', 'discuss', 'two', 'papers', 'reporting', 'recent', 'progress', 'on', 'semantic', 'representation', 'research', 'at', 'google', 'as', 'well', 'as', 'two', 'new', 'models', 'available', 'for', 'download', 'on', 'tensorflow', 'hub', 'that', 'we', 'hope', 'developers', 'will', 'use', 'to', 'build', 'new', 'and', 'exciting', 'applications', 'in', 'learning', 'semantic', 'textual', 'similarity', 'from', 'conversations', 'we', 'introduce', 'a', 'new', 'way', 'to', 'learn', 'sentence', 'representations', 'for', 'semantic', 'textual', 'similarity', 'the', 'intuition', 'is', 'that', 'sentences', 'are', 'semantically', 'similar', 'if', 'they', 'have', 'a', 'similar', 'distribution', 'of', 'responses', 'for', 'example', 'how', 'old', 'are', 'you', 'and', 'what', 'is', 'your', 'age', 'are', 'both', 'questions', 'about', 'age', 'which', 'can', 'be', 'answered', 'by', 'similar', 'responses', 'such', 'as', 'i', 'am', '20', 'years', 'old', 'in', 'contrast', 'while', 'how', 'are', 'you', 'and', 'how', 'old', 'are', 'you', 'contain', 'almost', 'identical', 'words', 'they', 'have', 'very', 'different', 'meanings', 'and', 'lead', 'to', 'different', 'responses', '</S>']
['<S>', 'the', 'recent', 'rapid', 'progress', 'of', 'neural', 'networkbased', 'natural', 'language', 'understanding', 'research', 'especially', 'on', 'learning', 'semantic', 'text', 'representations', 'can', 'enable', 'truly', 'novel', 'products', 'such', 'as', 'smart', 'compose', 'and', 'talk', 'to', 'books', 'it', 'can', 'also', 'help', 'improve', 'performance', 'on', 'a', 'variety', 'of', 'natural', 'language', 'tasks', 'which', 'have', 'limited', 'amounts', 'of', 'training', 'data', 'such', 'as', 'building', 'strong', 'text', 'classifiers', 'from', 'as', 'few', 'as', '100', 'labeled', 'examples', 'below', 'we', 'discuss', 'two', 'papers', 'reporting', 'recent', 'progress', 'on', 'semantic', 'representation', 'research', 'at', 'google', 'as', 'well', 'as', 'two', 'new', 'models', 'available', 'for', 'download', 'on', 'tensorflow', 'hub', 'that', 'we', 'hope', 'developers', 'will', 'use', 'to', 'build', 'new', 'and', 'exciting', 'applications', 'in', 'learning', 'semantic', 'textual', 'similarity', 'from', 'conversations', 'we', 'introduce', 'a', 'new', 'way', 'to', 'learn', 'sentence', 'representations', 'for', 'semantic', 'textual', 'similarity', 'the', 'intuition', 'is', 'that', 'sentences', 'are', 'semantically', 'similar', 'if', 'they', 'have', 'a', 'similar', 'distribution', 'of', 'responses', 'for', 'example', 'how', 'old', 'are', 'you', 'and', 'what', 'is', 'your', 'age', 'are', 'both', 'questions', 'about', 'age', 'which', 'can', 'be', 'answered', 'by', 'similar', 'responses', 'such', 'as', 'i', 'am', '20', 'years', 'old', 'in', 'contrast', 'while', 'how', 'are', 'you', 'and', 'how', 'old', 'are', 'you', 'contain', 'almost', 'identical', 'words', 'they', 'have', 'very', 'different', 'meanings', 'and', 'lead', 'to', 'different', 'responses', 'in', 'this', 'work', 'we', 'aim', 'to', 'learn', 'semantic', 'similarity', 'by', 'way', 'of', 'a', 'response', 'classification', 'task', 'given', 'a', 'conversational', 'input', 'we', 'wish', 'to', 'classify', 'the', 'correct', 'response', 'from', 'a', 'batch', 'of', 'randomly', 'selected', 'responses', 'but', 'the', 'ultimate', 'goal', 'is', 'to', 'learn', 'a', 'model', 'that', 'can', 'return', 'encodings', 'representing', 'a', 'variety', 'of', 'natural', 'language', 'relationships', 'including', 'similarity', 'and', 'relatedness', 'by', 'adding', 'another', 'prediction', 'task', 'in', 'this', 'case', 'the', 'snli', 'entailment', 'dataset', 'and', 'forcing', 'both', 'through', 'shared', 'encoding', 'layers', 'we', 'get', 'even', 'better', 'performance', 'on', 'similarity', 'measures', 'such', 'as', 'the', 'stsbenchmark', 'a', 'sentence', 'similarity', 'benchmark', 'and', 'cqa', 'task', 'b', 'a', 'questionquestion', 'similarity', 'task', 'this', 'is', 'because', 'logical', 'entailment', 'is', 'quite', 'different', 'from', 'simple', 'equivalence', 'and', 'provides', 'more', 'signal', 'for', 'learning', 'complex', 'semantic', 'representations', '</S>']
I am curious about this as well.
Facing the same issue.
Hi, thank you for asking this question! I'm wondering the same thing.
For what it's worth, I tried the example you provided on USE models released in versions 1, 2, and 3. The behavior you note is true for version 3 but not for versions 1 and 2. This makes me wonder how inputs are handled differently between versions.
@mvss80 , if you don't mind me asking, did you try splitting the long text into multiple strings, embedding each separately, and combining the embeddings? If so, how did you end up doing this, and how well did it work?
@sukanyamoorthy Would you please share the code to output the tokens?
Has there been any news on this - would be great from someone in the core team to help to understand the workings of this and other USE models for very long sentence embeddings.
It is true that the model discard the text after seen 128 tokens. It's the same as for USE Lite, discussed in the recent issue https://github.com/tensorflow/hub/issues/572
@arnoegw Is it possible to add this behaviour to the docs?
@arnoegw @gowthamkpr Is this clipping at 128 tokens the same for the base USE/4? If not, is there any clipping that does happen?
I am curious too!
This does not apply to USE 4 based on my tests
This is not a question about tf_hub but about the Universal Sentence Encoder. If this is not the right place, let me know the appropriate forum to post this.
I noticed that the transformer model (USE Large 3) yields the same embeddings for two strings if the first 128 words are the same. Does the model discard tokens beyond the first 128? I could not find this information in the paper.
Here is sample code where I get the embeddings for the first 3 and 4 paragraphs of the USE announcement blog (https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html). The embeddings are identical.
Running the code above will yield
True
. I wanted to confirm if there is indeed a 128 token limit and whether it can be changed.If there is a 128 token limit, I have a couple of follow-up questions.
A
andB
, the embedding for the entire text would be(sqrt(n)*A + sqrt(k)*B) / sqrt(n+k)
which then needs to be normalized. Is this correct? What this does not account for is howA
andB
were normalized. is there any way to getA
andB
embeddings before they are normalized to unit vectors?