jina-ai / examples

Jina examples and demos to help you get started
https://docs.jina.ai
Apache License 2.0
453 stars 140 forks source link

TF model #251

Closed ghassen1302 closed 3 years ago

ghassen1302 commented 3 years ago

I have a model saved using TensorFlow This is the folder architecture

.
├── _model
        ├── _assets
        ├── _variables
                ├── variables.data-00000-of-00001
                └── variables.index
        └──  saved_model.pb

I tried using TransformerTFEncoder but I wasn't successful as I couldn't identify the model.ckpt.index file and the configuration object. I also tried UniversalSentenceEncoder as the model exists in TF Hub but I wasn't successful also as it isn't supported What should I do ?

I have another question. If I use TransformerTFEncoder with pretrained_model_name_or_path=<name of a pre-trained model> withfrom_tf=True will it use the one in Hugging Face or the one in TF Hub.

I'm using Jina 0.6.6 Thanks

JoanFM commented 3 years ago

Hello @ghassen1302,

First of all, how have you tried to use TransformerTFEncoder or UniversalSentenceEncoder and why do you say that his lattest one is not supported?

TransformerTFEncoder will use Transformer API to call TFAutoModelForPreTraining.from_pretrained method. https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModelForPreTraining

I hope this helps

ghassen1302 commented 3 years ago

For UniversalSentenceEncoder the model (LaBSE) isn't supported as it's not in the universal sentence encoder collection (https://tfhub.dev/google/collections/universal-sentence-encoder/1) as specified in the documentation

For TransformerTFEncoder this is my encode.yml

!TransformerTFEncoder
with:
  pooling_strategy: auto
  pretrained_model_name_or_path: <model.ckpt.index path>
  max_length: 100
  from_tf: True

I'm a bit confused on which are the model.ckpt.index file and the configuration object. I will check the Hugging face documentation Thanks

JoanFM commented 3 years ago

For UniversalSentenceEncoder the model (LaBSE) isn't supported as it's not in the universal sentence encoder collection (https://tfhub.dev/google/collections/universal-sentence-encoder/1) as specified in the documentation

For TransformerTFEncoder this is my encode.yml

!TransformerTFEncoder
with:
  pooling_strategy: auto
  pretrained_model_name_or_path: <model.ckpt.index path>
  max_length: 100
  from_tf: True

I'm a bit confused on which are the model.ckpt.index file and the configuration object. I will check the Hugging face documentation Thanks

I have seen the TransformerTFEncoder documentation and it seems missleading since there is actually no parameter from_tf. As per the documentation of the params it says.

:param pretrained_model_name_or_path: Either:
            - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``.
            - a string with the `identifier name` of a pre-trained model that was user-uploaded to Hugging Face S3, e.g.: ``dbmdz/bert-base-german-cased``.
            - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``.
            - a path or url to a `tensorflow index checkpoint file` (e.g. `./tf_model/model.ckpt.index`). In this case, ``from_tf`` should be set to True and a configuration object should be provided as ``config`` argument.       

So I would say it may wait for the root source path directory to your saved model.

How have you generated this trained model?

ghassen1302 commented 3 years ago

No, I wasn't able I just have a model which I downloaded from TF Hub that I want to use as an encoder

JoanFM commented 3 years ago

No, I wasn't able I just have a model which I downloaded from TF Hub that I want to use as an encoder

Can you please share the logs you observe when trying to load the encoder with the configurarion you sent earlier?

ghassen1302 commented 3 years ago

For <model.ckpt.index path> I should use the path of variables.data-00000-of-00001 or variables.index or saved_model.pb ?

JoanFM commented 3 years ago

For <model.ckpt.index path> I should use the path of variables.data-00000-of-00001 or variables.index or saved_model.pb ?

I think u should use the base path of all these files, but I will try to reproduce this tomorrow if you can guide me about where to get this model or a similar one.

ghassen1302 commented 3 years ago

This is the model https://tfhub.dev/google/LaBSE/1 These are the logs

 encoder1@29752[E]:Can't load config for '../../../labse/saved_models/1'. Make sure that:

- '../../../labse/saved_models/1' is a correct model identifier listed on 'https://huggingface.co/models'

- or '../../../labse/saved_models/1' is the correct path to a directory containing a config.json file

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 355, in get_config_dict
    local_files_only=local_files_only,
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/file_utils.py", line 730, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file ../../../labse/saved_models/1/config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 291, in msg_callback
    self.zmqlet.send_message(self._callback(msg))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 265, in _callback
    self.pre_hook(msg).handle(msg).post_hook(msg)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/peapods/pea.py", line 159, in handle
    self.executor(self.request_type)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/__init__.py", line 568, in __call__
    d()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 243, in __call__
    self._traverse_apply(self.req.docs, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 248, in _traverse_apply
    self._traverse_rec(docs, None, None, [], *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/__init__.py", line 261, in _traverse_rec
    self._apply_all(docs, parent_doc, parent_edge_type, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/drivers/encode.py", line 32, in _apply_all
    embeds = self.exec_fn(contents)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/decorators.py", line 167, in arg_wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/executors/decorators.py", line 60, in arg_wrapper
    r = func(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/hub/encoders/nlp/TransformerTFEncoder/__init__.py", line 127, in encode
    if self.tokenizer.pad_token is None:
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/helper.py", line 666, in __get__
    value = obj.__dict__[f'CACHED_{self.func.__name__}'] = self.func(obj)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/jina/hub/encoders/nlp/TransformerTFEncoder/__init__.py", line 116, in tokenizer
    tokenizer = AutoTokenizer.from_pretrained(self.pretrained_model_name_or_path)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 216, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_auto.py", line 310, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 368, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for '../../../labse/saved_models/1'. Make sure that:

- '../../../labse/saved_models/1' is a correct model identifier listed on 'https://huggingface.co/models'

- or '../../../labse/saved_models/1' is the correct path to a directory containing a config.json file

As I showed my model folder earlier, it doesn't contain a config.json file. Thanks.

JoanFM commented 3 years ago

Hello @ghassen1302 ,

For the moment, you can use https://huggingface.co/pvl/labse_bert this hugging face model, so using pretrained_model_name_or_path = labse_bert should work for you.

ghassen1302 commented 3 years ago

Yes, I tried that and it worked. I wanted to use the one in TF Hub instead of Hugging Face. Is it possible?

bhavsarpratik commented 3 years ago

Hi @ghassen1302, the TF Hub model files does not have the config.json file required by the transformers library. You can get some of the files by clicking List all files in model on LABSE. Our transformer encoder is based on the transformers library and so they also expect the same files.

If you are training your own models without using transformers library, you can convert them to the format required by transformers. You can find scripts in transformers library.

For USE you can use this

ghassen1302 commented 3 years ago

I was able to fix it thanks. I created an encoder for LaBSE LaBSE_encode.py

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import bert
from jina.executors.encoders.frameworks import BaseTFEncoder
from jina.executors.decorators import batching, as_ndarray

class LabseEncoder(BaseTFEncoder):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.max_seq_length = 200
        self.labse_model, self.labse_layer = self.get_model(
            model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=self.max_seq_length)

        self.vocab_file = self.labse_layer.resolved_object.vocab_file.asset_path.numpy()
        self.do_lower_case = self.labse_layer.resolved_object.do_lower_case.numpy()
        self.tokenizer = bert.bert_tokenization.FullTokenizer(self.vocab_file, self.do_lower_case)

    def get_model(self, model_url, max_seq_length):
      labse_layer = hub.KerasLayer(model_url, trainable=True)

      # Define input.
      input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                             name="input_word_ids")
      input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                         name="input_mask")
      segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                          name="segment_ids")

      # LaBSE layer.
      pooled_output,  _ = labse_layer([input_word_ids, input_mask, segment_ids])

      # The embedding is l2 normalized.
      pooled_output = tf.keras.layers.Lambda(
          lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output)

      # Define model.
      return tf.keras.Model(
            inputs=[input_word_ids, input_mask, segment_ids],
            outputs=pooled_output), labse_layer

    def create_input(self, input_strings, tokenizer, max_seq_length):

      input_ids_all, input_mask_all, segment_ids_all = [], [], []
      for input_string in input_strings:
        # Tokenize input.
        input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
        input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
        sequence_length = min(len(input_ids), max_seq_length)

        # Padding or truncation.
        if len(input_ids) >= max_seq_length:
          input_ids = input_ids[:max_seq_length]
        else:
          input_ids = input_ids + [0] * (max_seq_length - len(input_ids))

        input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)

        input_ids_all.append(input_ids)
        input_mask_all.append(input_mask)
        segment_ids_all.append([0] * max_seq_length)

      return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all)

    @batching
    @as_ndarray
    def encode(self, input_text):
      input_ids, input_mask, segment_ids = self.create_input(
        input_text, self.tokenizer, self.max_seq_length)
      return np.array(self.labse_model([input_ids, input_mask, segment_ids]))

and I used it in my encode.yml

!LabseEncoder
metas:
  py_modules: LaBSE_encode.py
JoanFM commented 3 years ago

I was able to fix it thanks. I created an encoder for LaBSE LaBSE_encode.py

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import bert
from jina.executors.encoders.frameworks import BaseTFEncoder
from jina.executors.decorators import batching, as_ndarray

class LabseEncoder(BaseTFEncoder):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.max_seq_length = 200
        self.labse_model, self.labse_layer = self.get_model(
            model_url="https://tfhub.dev/google/LaBSE/1", max_seq_length=self.max_seq_length)

        self.vocab_file = self.labse_layer.resolved_object.vocab_file.asset_path.numpy()
        self.do_lower_case = self.labse_layer.resolved_object.do_lower_case.numpy()
        self.tokenizer = bert.bert_tokenization.FullTokenizer(self.vocab_file, self.do_lower_case)

    def get_model(self, model_url, max_seq_length):
      labse_layer = hub.KerasLayer(model_url, trainable=True)

      # Define input.
      input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                             name="input_word_ids")
      input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                         name="input_mask")
      segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
                                          name="segment_ids")

      # LaBSE layer.
      pooled_output,  _ = labse_layer([input_word_ids, input_mask, segment_ids])

      # The embedding is l2 normalized.
      pooled_output = tf.keras.layers.Lambda(
          lambda x: tf.nn.l2_normalize(x, axis=1))(pooled_output)

      # Define model.
      return tf.keras.Model(
            inputs=[input_word_ids, input_mask, segment_ids],
            outputs=pooled_output), labse_layer

    def create_input(self, input_strings, tokenizer, max_seq_length):

      input_ids_all, input_mask_all, segment_ids_all = [], [], []
      for input_string in input_strings:
        # Tokenize input.
        input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
        input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
        sequence_length = min(len(input_ids), max_seq_length)

        # Padding or truncation.
        if len(input_ids) >= max_seq_length:
          input_ids = input_ids[:max_seq_length]
        else:
          input_ids = input_ids + [0] * (max_seq_length - len(input_ids))

        input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)

        input_ids_all.append(input_ids)
        input_mask_all.append(input_mask)
        segment_ids_all.append([0] * max_seq_length)

      return np.array(input_ids_all), np.array(input_mask_all), np.array(segment_ids_all)

    @batching
    @as_ndarray
    def encode(self, input_text):
      input_ids, input_mask, segment_ids = self.create_input(
        input_text, self.tokenizer, self.max_seq_length)
      return np.array(self.labse_model([input_ids, input_mask, segment_ids]))

and I used it in my encode.yml

!LabseEncoder
metas:
  py_modules: LaBSE_encode.py

Hey @ghassen1302,

Perfect, this is a great example on how you can easily extend jina's functionality with your own custom executors and models! I hope you got what you wanted then.

ghassen1302 commented 3 years ago

Yes, I did, thanks.

bhavsarpratik commented 3 years ago

@ghassen1302 Glad you made it work. BTW, keep trainable=False

ghassen1302 commented 3 years ago

I will change it, thanks.