kpe / bert-for-tf2

A Keras TensorFlow 2.0 implementation of BERT, ALBERT and adapter-BERT.
https://github.com/kpe/bert-for-tf2
MIT License
803 stars 193 forks source link

Custom tokenizer layer #75

Closed ptamas88 closed 3 years ago

ptamas88 commented 3 years ago

Hi, I would like to incorporate the tokenization process into a model which is using bert layer. Here is my custom layer:

class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, max_length, **kwargs):
        self.vocab_path = vocab_path
        self.length = max_length
        self.tokenizer = bert.bert_tokenization.FullTokenizer(vocab_path, do_lower_case=False)
        super(TokenizationLayer, self).__init__(**kwargs)

    def call(self,inputs):
        tokens = self.tokenizer.tokenize(inputs)
        ids = self.tokenizer.convert_tokens_to_ids(tokens)
        ids += [self.tokenizer.vocab['[PAD]']] * (self.length-len(ids))
        return ids

And here is my code to test the custom layer within a dummy model:

inputs = tf.keras.layers.Input(shape=(), dtype='string')
tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

I get the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-8df4885e5c7a> in <module>
      1 inputs = tf.keras.layers.Input(shape=(), dtype='string')
      2 tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
----> 3 outputs = tokenization_layer(inputs)
      4 model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    924     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    925       return self._functional_construction_call(inputs, args, kwargs,
--> 926                                                 input_list)
    927 
    928     # Maintains info about the `Layer.call` stack.

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1115           try:
   1116             with ops.enable_auto_cast_variables(self._compute_dtype_object):
-> 1117               outputs = call_fn(cast_inputs, *args, **kwargs)
   1118 
   1119           except errors.OperatorNotAllowedInGraphError as e:

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    256       except Exception as e:  # pylint:disable=broad-except
    257         if hasattr(e, 'ag_error_metadata'):
--> 258           raise e.ag_error_metadata.to_exception(e)
    259         else:
    260           raise

ValueError: in user code:

    <ipython-input-60-d6c12f7d1b14>:17 call  *
        tokens = self.tokenizer.tokenize(inputs)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:172 tokenize  *
        for token in self.basic_tokenizer.tokenize(text):
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:198 tokenize  *
        text = convert_to_unicode(text)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:86 convert_to_unicode  *
        raise ValueError("Unsupported string type: %s" % (type(text)))

    ValueError: Unsupported string type: <class 'tensorflow.python.framework.ops.Tensor'>

Can you lease help how to solve this issue? I think the problem is that the tokenizer gets tensors not string and that is why it can't tokenize it. But if that is the case how should I mkae this work? Thanks

Shiro-LK commented 3 years ago

@ptamas88 Did you succeed to make it works ? I have the same question

kpe commented 3 years ago

yes, usually the tokenizer is not part of the graph. For this you'll need a tokenizer that has a TF implementation, like sentencepiece when using albert. For BERT you might try the tf.text BertTokenizer (https://github.com/tensorflow/text/blob/master/docs/api_docs/python/text/BertTokenizer.md) - I haven't used it myself, but it should work.

kpe commented 3 years ago

hope that helps:

pip install tensorflow-text

and then try something along those lines:

import tensorflow_text as text

tokenizer = text.BertTokenizer(os.path.join(ckpt_dir, 'vocab.txt'))
tok_ids = tokenizer.tokenize(["hello, cruel world!", "abcccccccd"]).merge_dims(-2,-1).to_tensor(shape=(2, max_seq_len))
ptamas88 commented 3 years ago

@ptamas88 Did you succeed to make it works ? I have the same question

haven't tried since, but i will check out the solution @kpe mentioned

keeson commented 3 years ago

it didn't work, still throw OperatorNotAllowedInGraphError