Extracting the tokenizer from Multilingual Universal Sentence Encoder

mariokostelac commented 4 years ago

I am evaluating MUSE model and it performs suboptimal when sentences contain technical acronyms (like LDAP). I assume it's because these tokens are OOV, but I'd like to understand whether that's the case by extracting the tokeniser and seeing what exactly the NN gets as an input.

Any pointers very welcome. Also, if this is not the right place and you know a better one, that help is appreciated, too 😊 .

jaxlaw commented 4 years ago

MUSE uses tensorflow_text's SentencepieceTokenizer. It is not possible to extract it from the TF2.0 saved_model object returned by hub.load(), the TF2.0 object does not expose the underlying graph. However, you may be able to download the model as a saved_model with the download button on https://tfhub.dev/google/universal-sentence-encoder-multilingual/3 and open it as a SavedModel proto. Within it you will find the GraphDef in meta_graph_def[0].graph_def. Then you can find the "SentencepieceOp" within the list of NodeDef in graph_def.node and obtain a string from node.attr['model'].s that is a serialized string of the Sentencepiece proto. From there you may be able to construct a SentencepieceTokenizer. Exactly how to implement this is left to the reader as an exercise.

mariokostelac commented 4 years ago

@jaxlaw thanks for these instructions, at least I can be sure that it's not doable the way I was trying to do it and there are some pointers where to start from 🙇 .

rmothukuru commented 4 years ago

@mariokostelac, Can you please confirm if the issue is resolved so that we can close it. Thanks!

BookChan commented 3 years ago

"SentencepieceOp" @jaxlaw It can't find SentencepieceOp.

Sample Code

import importlib MODEL_PATH = "./universal-sentence-encoder-multilingual-large_3/" loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl') saved_model = loader_impl.parse_saved_model(MODEL_PATH) graph = saved_model.meta_graphs[0].graph_def for node in graph.node: if "sentencepieceop" in node.name.lower(): print(node.name) # it doesn't run in this line break

arnoegw commented 3 years ago

It's the op name, not the node name.

BookChan commented 3 years ago

It's the op name, not the node name. @arnoegw Actually the same result. Can you show your code to get the sentencepiece op?

import importlib
MODEL_PATH = "./universal-sentence-encoder-multilingual-large_3/"
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
graph = saved_model.meta_graphs[0].graph_def
for node in graph.node:
if  "sentencepieceop"  in node.op.lower():
print(node.name,node.op)   # it doesn't  run in   this line
break

mariokostelac commented 3 years ago

@rmothukuru I haven't had a chance to work on this since, but the pointer are good enough to unblock me. I think we can consider this closed, at least for now.

mariokostelac commented 3 years ago

I've managed to do that in TF1. Here's the code in case somebody needs it

# TF1
import tensorflow as tf
import tensorflow_hub as hub
import tf_sentencepiece

# Set up graph.
g = tf.Graph()
with g.as_default():
  text_input = tf.placeholder(dtype=tf.string, shape=[None])
  #multiling_embed = hub.Module('./universal-sentence-encoder-multilingual_1/')
  multiling_embed = hub.Module('https://tfhub.dev/google/universal-sentence-encoder-multilingual/1') 
  embedded_text = multiling_embed(text_input)
  init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

# Initialize session.
session = tf.Session(graph=g)

tokenizer_op = g.get_operation_by_name('module/text_preprocessor/SentencepieceEncodeSparse')

# extract the vocab
embedding_words_op = g.get_operation_by_name('module/Embeddings_words')
embedding_words_tensor = embedding_words_op.values()[0]
dict_words = embedding_words_tensor.eval(session=session)

text = "hey there intercom LDAP AWS LOGIN login"

positional_encoding = tokenizer_op.values()[0].eval(feed_dict = {tokenizer_op.inputs._inputs[0]: [text]}, session=session)
token_ids = tokenizer_op.values()[1].eval(feed_dict = {tokenizer_op.inputs._inputs[0]: [text]}, session=session)
length = tokenizer_op.values()[2].eval(feed_dict = {tokenizer_op.inputs._inputs[0]: [text]}, session=session)
tokens = [dict_words[t] for t in token_ids]
positional_encoding, token_ids, length, tokens

Tokens variable will contain tokens like [b'\n\x03<s>', b'\n\x06\xe2\x96\x81hey']

mariokostelac commented 3 years ago

@BookChan thanks for the initial pointers! I've managed to explore the graph and instantiate the SentencePieceTokenizer from the graph.

Here is the hacky code I've had at the end for Tensorflow 2

# TF2
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_text.python.ops.sentencepiece_tokenizer import SentencepieceTokenizer

import importlib
MODEL_PATH = "./musev3/" # just downloaded and extracted in this dir
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')

saved_model = loader_impl.parse_saved_model(MODEL_PATH)
graph = saved_model.meta_graphs[0].graph_def

# extract functions that contain SentencePiece somewhere in there
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "sentencepiecetokenizeop" in str(f).lower()];

assert len(fns) == 1

# find SentencePieceOp (contains the model) in the found function
nodes_with_sp = [n for n in fns[0].node_def if n.op == "SentencepieceOp"]
assert len(nodes_with_sp) == 1
model_initializer_node = nodes_with_sp[0]
model = model_initializer_node.attr['model'].s
# we can pretty much save the model into a file since it does not change 

# instantiate the model
tokenizer = SentencepieceTokenizer(model)
token_ids = tokenizer.tokenize('https://twitter.com/mariokostelac').numpy()
print([tokenizer.id_to_string(token_id).numpy() for token_id in token_ids])

token_ids = tokenizer.tokenize('Want to learn about ML tooling? https://modelpredict.com').numpy()
print([tokenizer.id_to_string(token_id).numpy() for token_id in token_ids])

dayyass commented 3 years ago

@BookChan thanks for the initial pointers! I've managed to explore the graph and instantiate the SentencePieceTokenizer from the graph.

Here is the hacky code I've had at the end for Tensorflow 2

# TF2
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow_text.python.ops.sentencepiece_tokenizer import SentencepieceTokenizer

import importlib
MODEL_PATH = "./musev3/" # just downloaded and extracted in this dir
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')

saved_model = loader_impl.parse_saved_model(MODEL_PATH)
graph = saved_model.meta_graphs[0].graph_def

# extract functions that contain SentencePiece somewhere in there
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "sentencepiecetokenizeop" in str(f).lower()];

assert len(fns) == 1

# find SentencePieceOp (contains the model) in the found function
nodes_with_sp = [n for n in fns[0].node_def if n.op == "SentencepieceOp"]
assert len(nodes_with_sp) == 1
model_initializer_node = nodes_with_sp[0]
model = model_initializer_node.attr['model'].s
# we can pretty much save the model into a file since it does not change 

# instantiate the model
tokenizer = SentencepieceTokenizer(model)
token_ids = tokenizer.tokenize('https://twitter.com/mariokostelac').numpy()
print([tokenizer.id_to_string(token_id).numpy() for token_id in token_ids])

token_ids = tokenizer.tokenize('Want to learn about ML tooling? https://modelpredict.com').numpy()
print([tokenizer.id_to_string(token_id).numpy() for token_id in token_ids])

I faced with the need to get tokens that are fed into the model and found this issue and your implementation of getting a tokenizer useful. Inspired by your implementation, I decided to improve it a little so bytes are transformed into strings. Here is the implementation: https://gist.github.com/dayyass/d02036838213fab1f8fab4837279f7b9

ChrisBobotsis commented 2 years ago

Any suggestions on how to alter this for https://tfhub.dev/google/universal-sentence-encoder/4

I've tried to use the code from @dayyass and @mariokostelac but it doesn't seem to work (assertion errors on the length of the list)

dayyass commented 6 months ago

Hi, everyone!

I exported mUSE model from TF to PyTorch and I want to share it with you!

The model itself is available in HF Models, directly through torch (currently, without native support for transformers), the conversion code and the work done itself are available in GitHub.

To be honest, the work was not the easiest, and in fact I completely manually rewrote the TF calculation graph in PyTorch. I hope that this will be useful, especially given the RAG approaches trend, where good and strong encoders are needed for end-to-end training and fine-tuning 🙏

tensorflow / hub

Extracting the tokenizer from Multilingual Universal Sentence Encoder #662

Sample Code