huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.05k stars 26.8k forks source link

Allow tensorflow tensors as input to Tokenizer #8495

Closed rbrthogan closed 3 years ago

rbrthogan commented 3 years ago

Firstly thanks so much for all the amazing work!

I'm trying to package a model for use in TF Serving. The problem is that everywhere I see this done, the tokenisation step happens outside of the server. I want to include this step inside the server so the user can just provide raw text as the input and not need to know anything about tokenization.

Here's how I'm trying to do it

def save_model(model, tokenizer, output_path):

    @tf.function(input_signature=[tf.TensorSpec(shape=[None], dtype=tf.string)])
    def serving(input_text):

        inputs = tokenizer(input_text, padding='longest', truncation=True, return_tensors="tf")
        outputs = model(inputs)
        logits = outputs[0]
        probs = tf.nn.softmax(logits, axis=1).numpy()[:, 1]
        predictions = tf.cast(tf.math.round(probs), tf.int32)
        return {
            'classes': predictions,
            'probabilities': probs
        }

    print(f'Exporting model for TF Serving in {tf_serving_output}')
    tf.saved_model.save(model, export_dir=output_path, signatures=serving)

where e.g.

model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2',  num_labels=num_classes)
tokenizer`= = AutoTokenizer.from_pretrained('albert-base-v2')

The problem is that the tokenization step results in

AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

clearly it wants plain python strings, not tensorflow tensors.

Would appreciate any help, workarounds, or ideally of course, this to be supported.


Running: transformers==3.4.0 tensorflow==2.3.0

LysandreJik commented 3 years ago

I believe @jplu has already used TF Serving. Do you know if it's possible to include tokenization in it?

jplu commented 3 years ago

Hello!

Unfortunately it is currently not possible to integrate our tokenizer directly inside a model due to some TensorFlow limitations. Nevertheless, there might be a solution by trying to create your own Tokenization layer such as the one the TF team is working on.

rbrthogan commented 3 years ago

Thanks for response and for the link.

Ya, it's a shame that there is still no way to use plain python in the signature.

I'll likely just find a different work around e.g. converting to PyTorch and serving with TorchServe.

I'll close this for now.

maxzzze commented 3 years ago

I found a working soltuion that doesn't require any changes to Tensorflow or Transformers.

Commenting because I came across this trying to do something similar. I actually think the issue here is not tensorflow but the transformer type checking for the tokenizer call which doesn't allow for the tensorflow objects.

I made the following implementation which appears to be working and doesn't rely on anything due to tensorflow limitations:

# NOTE: the specific model here will need to be overwritten because AutoModel doesn't work
class CustomModel(transformers.TFDistilBertForSequenceClassification):

    def call_tokenizer(self, input):
        if type(input) == list:
            return self.tokenizer([str(x) for x in input], return_tensors='tf')

        else:
            return self.tokenizer(str(input), return_tensors='tf')

    @tf.function(input_signature=[tf.TensorSpec(shape=(1, ), dtype=tf.string)])
    def serving(self, content: str):
        batch = self.call_tokenizer(content)
        batch = dict(batch)
        batch = [batch]
        output = self.call(batch)
        return self.serving_output(output)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_path,
    use_fast=True
)

config = transformers.AutoConfig.from_pretrained(
    model_path,
    num_labels=2,
    from_pt=True
)

model = CustomModel.from_pretrained(
    model_path,
    config=config,
    from_pt=True
)

model.tokenizer = tokenizer
model.id2label = config.id2label
model.save_pretrained("model", saved_model=True)
bluekidds commented 3 years ago

Hi @maxzzze

I was also working on including hf tokenizer into tf model. However, I found that inside call_tokenizer, the results tokenizer return would always be the same despites the text input you passed in.

Have you also encounter such issue? I am thinking save_pretrained wasn't including the tokenizer appropriately.

I found a working soltuion that doesn't require any changes to Tensorflow or Transformers.

Commenting because I came across this trying to do something similar. I actually think the issue here is not tensorflow but the transformer type checking for the tokenizer call which doesn't allow for the tensorflow objects.

I made the following implementation which appears to be working and doesn't rely on anything due to tensorflow limitations:

# NOTE: the specific model here will need to be overwritten because AutoModel doesn't work
class CustomModel(transformers.TFDistilBertForSequenceClassification):

    def call_tokenizer(self, input):
        if type(input) == list:
            return self.tokenizer([str(x) for x in input], return_tensors='tf')

        else:
            return self.tokenizer(str(input), return_tensors='tf')

    @tf.function(input_signature=[tf.TensorSpec(shape=(1, ), dtype=tf.string)])
    def serving(self, content: str):
        batch = self.call_tokenizer(content)
        batch = dict(batch)
        batch = [batch]
        output = self.call(batch)
        return self.serving_output(output)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_path,
    use_fast=True
)

config = transformers.AutoConfig.from_pretrained(
    model_path,
    num_labels=2,
    from_pt=True
)

model = CustomModel.from_pretrained(
    model_path,
    config=config,
    from_pt=True
)

model.tokenizer = tokenizer
model.id2label = config.id2label
model.save_pretrained("model", saved_model=True)