Closed ghassen1302 closed 3 years ago
Hello @ghassen1302,
First of all, how have you tried to use TransformerTFEncoder or UniversalSentenceEncoder and why do you say that his lattest one is not supported?
TransformerTFEncoder will use Transformer API to call TFAutoModelForPreTraining.from_pretrained method. https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModelForPreTraining
I hope this helps
For UniversalSentenceEncoder
the model (LaBSE) isn't supported as it's not in the universal sentence encoder collection (https://tfhub.dev/google/collections/universal-sentence-encoder/1) as specified in the documentation
For TransformerTFEncoder
this is my encode.yml
!TransformerTFEncoder
with:
pooling_strategy: auto
pretrained_model_name_or_path: <model.ckpt.index path>
max_length: 100
from_tf: True
I'm a bit confused on which are the model.ckpt.index
file and the configuration object.
I will check the Hugging face documentation
Thanks
For
UniversalSentenceEncoder
the model (LaBSE) isn't supported as it's not in the universal sentence encoder collection (https://tfhub.dev/google/collections/universal-sentence-encoder/1) as specified in the documentationFor
TransformerTFEncoder
this is my encode.yml!TransformerTFEncoder with: pooling_strategy: auto pretrained_model_name_or_path: <model.ckpt.index path> max_length: 100 from_tf: True
I'm a bit confused on which are the
model.ckpt.index
file and the configuration object. I will check the Hugging face documentation Thanks
I have seen the TransformerTFEncoder documentation and it seems missleading since there is actually no parameter from_tf
. As per the documentation of the params it says.
:param pretrained_model_name_or_path: Either:
- a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
- a string with the `identifier name` of a pre-trained model that was user-uploaded to Hugging Face S3, e.g.: ``dbmdz/bert-base-german-cased``.
- a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
- a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument.
So I would say it may wait for the root source path directory to your saved model.
How have you generated this trained model?
No, I wasn't able I just have a model which I downloaded from TF Hub that I want to use as an encoder
No, I wasn't able I just have a model which I downloaded from TF Hub that I want to use as an encoder
Can you please share the logs you observe when trying to load the encoder with the configurarion you sent earlier?
For <model.ckpt.index path>
I should use the path of variables.data-00000-of-00001
or variables.index
or saved_model.pb
?
For
<model.ckpt.index path>
I should use the path ofvariables.data-00000-of-00001
orvariables.index
orsaved_model.pb
?
I think u should use the base path of all these files, but I will try to reproduce this tomorrow if you can guide me about where to get this model or a similar one.
This is the model https://tfhub.dev/google/LaBSE/1 These are the logs
encoder1@29752[E]:Can't load config for '../../../labse/saved_models/1'. Make sure that:
- '../../../labse/saved_models/1' is a correct model identifier listed on 'https://huggingface.co/models'
- or '../../../labse/saved_models/1' is the correct path to a directory containing a config.json file
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 355, in get_config_dict
local_files_only=local_files_only,
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/file_utils.py", line 730, in cached_path
raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file ../../../labse/saved_models/1/config.json not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 291, in msg_callback
self.zmqlet.send_message(self._callback(msg))
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 265, in _callback
self.pre_hook(msg).handle(msg).post_hook(msg)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 159, in handle
self.executor(self.request_type)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/__init__.py", line 568, in __call__
d()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 243, in __call__
self._traverse_apply(self.req.docs, *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 248, in _traverse_apply
self._traverse_rec(docs, None, None, [], *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 261, in _traverse_rec
self._apply_all(docs, parent_doc, parent_edge_type, *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/encode.py", line 32, in _apply_all
embeds = self.exec_fn(contents)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/decorators.py", line 167, in arg_wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/decorators.py", line 60, in arg_wrapper
r = func(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/hub/encoders/nlp/TransformerTFEncoder/__init__.py", line 127, in encode
if self.tokenizer.pad_token is None:
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/helper.py", line 666, in __get__
value = obj.__dict__[f'CACHED_{self.func.__name__}'] = self.func(obj)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/hub/encoders/nlp/TransformerTFEncoder/__init__.py", line 116, in tokenizer
tokenizer = AutoTokenizer.from_pretrained(self.pretrained_model_name_or_path)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 216, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_auto.py", line 310, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 368, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for '../../../labse/saved_models/1'. Make sure that:
- '../../../labse/saved_models/1' is a correct model identifier listed on 'https://huggingface.co/models'
- or '../../../labse/saved_models/1' is the correct path to a directory containing a config.json file
As I showed my model folder earlier, it doesn't contain a config.json
file.
Thanks.
Hello @ghassen1302 ,
For the moment, you can use https://huggingface.co/pvl/labse_bert
this hugging face model, so using pretrained_model_name_or_path = labse_bert
should work for you.
Yes, I tried that and it worked. I wanted to use the one in TF Hub instead of Hugging Face. Is it possible?
Hi @ghassen1302, the TF Hub model files does not have the config.json file required by the transformers library. You can get some of the files by clicking List all files in model
on LABSE. Our transformer encoder is based on the transformers library and so they also expect the same files.
If you are training your own models without using transformers library, you can convert them to the format required by transformers. You can find scripts in transformers library.
For USE you can use this
I was able to fix it thanks.
I created an encoder for LaBSE LaBSE_encode.py
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import bert
from jina.executors.encoders.frameworks import BaseTFEncoder
from jina.executors.decorators import batching, as_ndarray
class LabseEncoder(BaseTFEncoder):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.max_seq_length = 200
self.labse_model, self.labse_layer = self.get_model(
model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=self.max_seq_length)
self.vocab_file = self.labse_layer.resolved_object.vocab_file.asset_path.numpy()
self.do_lower_case = self.labse_layer.resolved_object.do_lower_case.numpy()
self.tokenizer = bert.bert_tokenization.FullTokenizer(self.vocab_file, self.do_lower_case)
def get_model(self, model_url, max_seq_length):
labse_layer = hub.KerasLayer(model_url, trainable=True)
# Define input.
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="input_mask")
segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
name="segment_ids")
# LaBSE layer.
pooled_output, _ = labse_layer([input_word_ids, input_mask, segment_ids])
# The embedding is l2 normalized.
pooled_output = tf.keras.layers.Lambda(
lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output)
# Define model.
return tf.keras.Model(
inputs=[input_word_ids, input_mask, segment_ids],
outputs=pooled_output), labse_layer
def create_input(self, input_strings, tokenizer, max_seq_length):
input_ids_all, input_mask_all, segment_ids_all = [], [], []
for input_string in input_strings:
# Tokenize input.
input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
sequence_length = min(len(input_ids), max_seq_length)
# Padding or truncation.
if len(input_ids) >= max_seq_length:
input_ids = input_ids[:max_seq_length]
else:
input_ids = input_ids + [0] * (max_seq_length - len(input_ids))
input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)
input_ids_all.append(input_ids)
input_mask_all.append(input_mask)
segment_ids_all.append([0] * max_seq_length)
return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all)
@batching
@as_ndarray
def encode(self, input_text):
input_ids, input_mask, segment_ids = self.create_input(
input_text, self.tokenizer, self.max_seq_length)
return np.array(self.labse_model([input_ids, input_mask, segment_ids]))
and I used it in my encode.yml
!LabseEncoder
metas:
py_modules: LaBSE_encode.py
I was able to fix it thanks. I created an encoder for LaBSE
LaBSE_encode.py
import numpy as np import tensorflow as tf import tensorflow_hub as hub import bert from jina.executors.encoders.frameworks import BaseTFEncoder from jina.executors.decorators import batching, as_ndarray class LabseEncoder(BaseTFEncoder): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.max_seq_length = 200 self.labse_model, self.labse_layer = self.get_model( model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=self.max_seq_length) self.vocab_file = self.labse_layer.resolved_object.vocab_file.asset_path.numpy() self.do_lower_case = self.labse_layer.resolved_object.do_lower_case.numpy() self.tokenizer = bert.bert_tokenization.FullTokenizer(self.vocab_file, self.do_lower_case) def get_model(self, model_url, max_seq_length): labse_layer = hub.KerasLayer(model_url, trainable=True) # Define input. input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids") input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask") segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids") # LaBSE layer. pooled_output, _ = labse_layer([input_word_ids, input_mask, segment_ids]) # The embedding is l2 normalized. pooled_output = tf.keras.layers.Lambda( lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output) # Define model. return tf.keras.Model( inputs=[input_word_ids, input_mask, segment_ids], outputs=pooled_output), labse_layer def create_input(self, input_strings, tokenizer, max_seq_length): input_ids_all, input_mask_all, segment_ids_all = [], [], [] for input_string in input_strings: # Tokenize input. input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"] input_ids = tokenizer.convert_tokens_to_ids(input_tokens) sequence_length = min(len(input_ids), max_seq_length) # Padding or truncation. if len(input_ids) >= max_seq_length: input_ids = input_ids[:max_seq_length] else: input_ids = input_ids + [0] * (max_seq_length - len(input_ids)) input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length) input_ids_all.append(input_ids) input_mask_all.append(input_mask) segment_ids_all.append([0] * max_seq_length) return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all) @batching @as_ndarray def encode(self, input_text): input_ids, input_mask, segment_ids = self.create_input( input_text, self.tokenizer, self.max_seq_length) return np.array(self.labse_model([input_ids, input_mask, segment_ids]))
and I used it in my
encode.yml
!LabseEncoder metas: py_modules: LaBSE_encode.py
Hey @ghassen1302,
Perfect, this is a great example on how you can easily extend jina's functionality with your own custom executors and models! I hope you got what you wanted then.
Yes, I did, thanks.
@ghassen1302 Glad you made it work. BTW, keep trainable=False
I will change it, thanks.
I have a model saved using TensorFlow This is the folder architecture
I tried using
TransformerTFEncoder
but I wasn't successful as I couldn't identify themodel.ckpt.index
file and the configuration object. I also triedUniversalSentenceEncoder
as the model exists in TF Hub but I wasn't successful also as it isn't supported What should I do ?I have another question. If I use
TransformerTFEncoder
withpretrained_model_name_or_path=<name of a pre-trained model>
withfrom_tf=True
will it use the one in Hugging Face or the one in TF Hub.I'm using
Jina 0.6.6
Thanks