Max number of tokens considered by Universal Sentence Encoder Large 3

mvss80 commented 5 years ago

This is not a question about tf_hub but about the Universal Sentence Encoder. If this is not the right place, let me know the appropriate forum to post this.

I noticed that the transformer model (USE Large 3) yields the same embeddings for two strings if the first 128 words are the same. Does the model discard tokens beyond the first 128? I could not find this information in the paper.

Here is sample code where I get the embeddings for the first 3 and 4 paragraphs of the USE announcement blog (https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html). The embeddings are identical.

import tensorflow as tf  
import tensorflow_hub as hub

t1 = 'The recent rapid progress of neural network-based natural language understanding research, especially on learning semantic text representations, can enable truly novel products such as Smart Compose and Talk to Books. It can also help improve performance on a variety of natural language tasks which have limited amounts of training data, such as building strong text classifiers from as few as 100 labeled examples. Below, we discuss two papers reporting recent progress on semantic representation research at Google, as well as two new models available for download on TensorFlow Hub that we hope developers will use to build new and exciting applications. In “Learning Semantic Textual Similarity from Conversations”, we introduce a new way to learn sentence representations for semantic textual similarity. The intuition is that sentences are semantically similar if they have a similar distribution of responses. For example, “How old are you?” and “What is your age?” are both questions about age, which can be answered by similar responses such as “I am 20 years old”. In contrast, while “How are you?” and “How old are you?” contain almost identical words, they have very different meanings and lead to different responses.' 

t2 = 'The recent rapid progress of neural network-based natural language understanding research, especially on learning semantic text representations, can enable truly novel products such as Smart Compose and Talk to Books. It can also help improve performance on a variety of natural language tasks which have limited amounts of training data, such as building strong text classifiers from as few as 100 labeled examples. Below, we discuss two papers reporting recent progress on semantic representation research at Google, as well as two new models available for download on TensorFlow Hub that we hope developers will use to build new and exciting applications. In “Learning Semantic Textual Similarity from Conversations”, we introduce a new way to learn sentence representations for semantic textual similarity. The intuition is that sentences are semantically similar if they have a similar distribution of responses. For example, “How old are you?” and “What is your age?” are both questions about age, which can be answered by similar responses such as “I am 20 years old”. In contrast, while “How are you?” and “How old are you?” contain almost identical words, they have very different meanings and lead to different responses. In this work, we aim to learn semantic similarity by way of a response classification task: given a conversational input, we wish to classify the correct response from a batch of randomly selected responses. But, the ultimate goal is to learn a model that can return encodings representing a variety of natural language relationships, including similarity and relatedness. By adding another prediction task (In this case, the SNLI entailment dataset) and forcing both through shared encoding layers, we get even better performance on similarity measures such as the STSBenchmark (a sentence similarity benchmark) and CQA task B (a question/question similarity task). This is because logical entailment is quite different from simple equivalence and provides more signal for learning complex semantic representations. ' 

with tf.Session() as session: 
     session.run([tf.global_variables_initializer(), tf.tables_initializer()]) 
     embeddings = session.run(embed([t1, t2]))

print(np.allclose(embeddings[0, :], embeddings[1, :]))

Running the code above will yield True. I wanted to confirm if there is indeed a 128 token limit and whether it can be changed.

If there is a 128 token limit, I have a couple of follow-up questions.

The arxiv paper mentions that the model is using PTB tokenization and the aclweb paper mentions that all punctuation is removed. Can you share the specifics of tokenization?
If I need an embedding for text that has more than 128 tokens, I can extract separate embeddings for each part with 128 or less and then combine them. The archive paper mentions that the sentence embedding is obtained by summing each element of the word embedding and dividing by the sqrt of the number of tokens. For example, if I break the text into two parts with n and k tokens, and get corresponding embeddings A and B, the embedding for the entire text would be (sqrt(n)*A + sqrt(k)*B) / sqrt(n+k) which then needs to be normalized. Is this correct? What this does not account for is how A and B were normalized. is there any way to get A and B embeddings before they are normalized to unit vectors?

dav-ell commented 5 years ago

I am curious about this as well.

sukanyamoorthy commented 5 years ago

I was able to look at the tokens using tensorflow debugger and the operation "module_apply_default/text_preprocessor/tokenize/StringSplit:1"

Tensor "module_apply_default/text_preprocessor/tokenize/StringSplit:1:DebugIdentity":                                                                                              
  dtype: object                                                                                                                                                                    
  shape: (4,)                                                                                                                                                                      

array([b'<S>', b'arya', b'stark', b'</S>'], dtype=object) 

Tokens  fetched from the USE operation "module_apply_default/text_preprocessor/tokenize/StringSplit:1":

['<S>', 'the', 'recent', 'rapid', 'progress', 'of', 'neural', 'networkbased', 'natural', 'language', 'understanding', 'research', 'especially', 'on', 'learning', 'semantic', 'text', 'representations', 'can', 'enable', 'truly', 'novel', 'products', 'such', 'as', 'smart', 'compose', 'and', 'talk', 'to', 'books', 'it', 'can', 'also', 'help', 'improve', 'performance', 'on', 'a', 'variety', 'of', 'natural', 'language', 'tasks', 'which', 'have', 'limited', 'amounts', 'of', 'training', 'data', 'such', 'as', 'building', 'strong', 'text', 'classifiers', 'from', 'as', 'few', 'as', '100', 'labeled', 'examples', 'below', 'we', 'discuss', 'two', 'papers', 'reporting', 'recent', 'progress', 'on', 'semantic', 'representation', 'research', 'at', 'google', 'as', 'well', 'as', 'two', 'new', 'models', 'available', 'for', 'download', 'on', 'tensorflow', 'hub', 'that', 'we', 'hope', 'developers', 'will', 'use', 'to', 'build', 'new', 'and', 'exciting', 'applications', 'in', 'learning', 'semantic', 'textual', 'similarity', 'from', 'conversations', 'we', 'introduce', 'a', 'new', 'way', 'to', 'learn', 'sentence', 'representations', 'for', 'semantic', 'textual', 'similarity', 'the', 'intuition', 'is', 'that', 'sentences', 'are', 'semantically', 'similar', 'if', 'they', 'have', 'a', 'similar', 'distribution', 'of', 'responses', 'for', 'example', 'how', 'old', 'are', 'you', 'and', 'what', 'is', 'your', 'age', 'are', 'both', 'questions', 'about', 'age', 'which', 'can', 'be', 'answered', 'by', 'similar', 'responses', 'such', 'as', 'i', 'am', '20', 'years', 'old', 'in', 'contrast', 'while', 'how', 'are', 'you', 'and', 'how', 'old', 'are', 'you', 'contain', 'almost', 'identical', 'words', 'they', 'have', 'very', 'different', 'meanings', 'and', 'lead', 'to', 'different', 'responses', '</S>']

['<S>', 'the', 'recent', 'rapid', 'progress', 'of', 'neural', 'networkbased', 'natural', 'language', 'understanding', 'research', 'especially', 'on', 'learning', 'semantic', 'text', 'representations', 'can', 'enable', 'truly', 'novel', 'products', 'such', 'as', 'smart', 'compose', 'and', 'talk', 'to', 'books', 'it', 'can', 'also', 'help', 'improve', 'performance', 'on', 'a', 'variety', 'of', 'natural', 'language', 'tasks', 'which', 'have', 'limited', 'amounts', 'of', 'training', 'data', 'such', 'as', 'building', 'strong', 'text', 'classifiers', 'from', 'as', 'few', 'as', '100', 'labeled', 'examples', 'below', 'we', 'discuss', 'two', 'papers', 'reporting', 'recent', 'progress', 'on', 'semantic', 'representation', 'research', 'at', 'google', 'as', 'well', 'as', 'two', 'new', 'models', 'available', 'for', 'download', 'on', 'tensorflow', 'hub', 'that', 'we', 'hope', 'developers', 'will', 'use', 'to', 'build', 'new', 'and', 'exciting', 'applications', 'in', 'learning', 'semantic', 'textual', 'similarity', 'from', 'conversations', 'we', 'introduce', 'a', 'new', 'way', 'to', 'learn', 'sentence', 'representations', 'for', 'semantic', 'textual', 'similarity', 'the', 'intuition', 'is', 'that', 'sentences', 'are', 'semantically', 'similar', 'if', 'they', 'have', 'a', 'similar', 'distribution', 'of', 'responses', 'for', 'example', 'how', 'old', 'are', 'you', 'and', 'what', 'is', 'your', 'age', 'are', 'both', 'questions', 'about', 'age', 'which', 'can', 'be', 'answered', 'by', 'similar', 'responses', 'such', 'as', 'i', 'am', '20', 'years', 'old', 'in', 'contrast', 'while', 'how', 'are', 'you', 'and', 'how', 'old', 'are', 'you', 'contain', 'almost', 'identical', 'words', 'they', 'have', 'very', 'different', 'meanings', 'and', 'lead', 'to', 'different', 'responses', 'in', 'this', 'work', 'we', 'aim', 'to', 'learn', 'semantic', 'similarity', 'by', 'way', 'of', 'a', 'response', 'classification', 'task', 'given', 'a', 'conversational', 'input', 'we', 'wish', 'to', 'classify', 'the', 'correct', 'response', 'from', 'a', 'batch', 'of', 'randomly', 'selected', 'responses', 'but', 'the', 'ultimate', 'goal', 'is', 'to', 'learn', 'a', 'model', 'that', 'can', 'return', 'encodings', 'representing', 'a', 'variety', 'of', 'natural', 'language', 'relationships', 'including', 'similarity', 'and', 'relatedness', 'by', 'adding', 'another', 'prediction', 'task', 'in', 'this', 'case', 'the', 'snli', 'entailment', 'dataset', 'and', 'forcing', 'both', 'through', 'shared', 'encoding', 'layers', 'we', 'get', 'even', 'better', 'performance', 'on', 'similarity', 'measures', 'such', 'as', 'the', 'stsbenchmark', 'a', 'sentence', 'similarity', 'benchmark', 'and', 'cqa', 'task', 'b', 'a', 'questionquestion', 'similarity', 'task', 'this', 'is', 'because', 'logical', 'entailment', 'is', 'quite', 'different', 'from', 'simple', 'equivalence', 'and', 'provides', 'more', 'signal', 'for', 'learning', 'complex', 'semantic', 'representations', '</S>']

ztx0728 commented 5 years ago

I am curious about this as well.

HaniehP commented 5 years ago

Facing the same issue.

csestili commented 5 years ago

Hi, thank you for asking this question! I'm wondering the same thing.

For what it's worth, I tried the example you provided on USE models released in versions 1, 2, and 3. The behavior you note is true for version 3 but not for versions 1 and 2. This makes me wonder how inputs are handled differently between versions.

@mvss80 , if you don't mind me asking, did you try splitting the long text into multiple strings, embedding each separately, and combining the embeddings? If so, how did you end up doing this, and how well did it work?

walaam98 commented 4 years ago

@sukanyamoorthy Would you please share the code to output the tokens?

ydennisy commented 4 years ago

Has there been any news on this - would be great from someone in the core team to help to understand the workings of this and other USE models for very long sentence embeddings.

arnoegw commented 4 years ago

It is true that the model discard the text after seen 128 tokens. It's the same as for USE Lite, discussed in the recent issue https://github.com/tensorflow/hub/issues/572

ydennisy commented 4 years ago

@arnoegw Is it possible to add this behaviour to the docs?

akashsara commented 4 years ago

@arnoegw @gowthamkpr Is this clipping at 128 tokens the same for the base USE/4? If not, is there any clipping that does happen?

QtRoS commented 4 years ago

I am curious too!

wenxijuji commented 1 year ago

This does not apply to USE 4 based on my tests

tensorflow / hub

Max number of tokens considered by Universal Sentence Encoder Large 3 #244