huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.07k stars 26.81k forks source link

HF TF models can be used with TFX #16447

Closed gante closed 1 year ago

gante commented 2 years ago

🚀 Feature request

Hugging Face TensorFlow models can be used with TensorFlow Extended.

Motivation

TensorFlow Extended goes beyond Tensorflow with respect to production-ready capabilities. Ensuring our models can be used with it just like TensorFlow Hub models would open new possibilities for TF/TFX users.

This issue will be used to discuss, plan, and track work related to the goal of making HF TF models TFX-compatible.

jinnovation commented 2 years ago

Sounds exciting. I'm curious to hear if you all at HF plan to collaborate with @rcrowe-google and the TFX folks on this and, if so, in what capacity.

rcrowe-google commented 2 years ago

Yup. We're just starting to meet now, with a goal of strong alignment.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante commented 2 years ago

Mr. bot, certainly not stale ;)

rclough commented 2 years ago

Hey there, I've done some work on investigating HF usage within my own organization and thought I'd braindump some of my findings on the topic so far in case you'd find it helpful.

Hopefully folks from the TF(X) teams can tell me if I'm off base with anything I describe here, or otherwise confirm this analysis.

The problem of supporting HF within TFX breaks down into a few key issues that are somewhat intertwined - Tokenizers, and model formats.

Model Formats

HF Transformers supports models in pytorch, Tensorflow, and Flax ("Supported frameworks" chart). TFX is primarily designed with Tensorflow in mind, so for the easiest scope of addressing the problem, I'd focus on supporting TF models.

TFX can support other frameworks, however clunky, and indeed examples exist for training things like sklearn, xgboost and flax, but the support is limited at time of writing to training and to some degree eval through the use of a custom extractor, but support is missing for say BulkInferrer which assumes a TF model, although perhaps this could be addressed through similar functionality to the evaluator.

In either case, one will need to use a custom image for TFX which includes the extra dependencies for HuggingFace (and potentially pytorch if one wishes to use a pytorch model)

Tokenization

This is more the crux of the issue as I see it. In TF(X), it is common to try and include the preprocessing of the data as part of the model graph, so that a model can predict on raw data, without relying on tokenization happening on client side or implementing it separately in the serving layer. This is commonly tackled by either using Tensorflow Transform or Keras Preprocessing Layers. For TF Hub models, usually the preprocessing code is added to either of these pieces, or something from TF Text is used.

For Huggingface, it's hard but not impossible to figure how this might fit into the picture. In the most optimistic scenario, I think it might be possible that the slow tokenizers can be annotated with @tf.function and used in TFT or KPL, and included in the graph (I have not attempted this). Some related docs here, at least for the case of TF-based HF models. If this is possible, then HF should be able to fit somewhat neatly in the rest of the TFX ecosystem when using a TF model.

If this is not possible, or one needs to use the faster rust-based tokenizers (see "Supported frameworks" chart linked above), either work needs to be done to make them usable within a tensorflow graph (far beyond my knowledge) or there will need to be a TFX component that can take TF Records, tokenize them, and output tokenized TF Records, but ALSO somehow indicate what tokenizer is to be used in the Eval process (which would require a custom extractor whether youre using TF or another framework), and potentially as well in a batch prediction component like Bulk Inferrer (or perhaps you could chain the aforementioned tokenizer component).

This is also not considering what will need to happen if you wish to serve the model online, where it seems HF's current example implies tokenization on client side, or at least a python based intermediary service as opposed to using TFServing directly or pytorch equivalents (I'm less knowledgable there). If tokenization can be included in the model graph, then this could at least be avoided for TF models.

gante commented 2 years ago

Hi @rclough 👋 Thank you for your notes, they are very helpful! Curiously, today we also talked internally about the tokenizers and their interoperability with downstream TF Graphs (the model) -- in a perfect world, tokenizer + model would go in a single serializable graph. We may have news soon, stay tuned!

cc @Rocketknight1

rclough commented 2 years ago

Great to hear! I would love to see that happen!

I also forgot to mention a 3rd approach to tokenization - IIRC it may be possible to use the metadata for some of HF's tokenizers to instantiate TF Text tokenizers with the same implementation, for example on issue #5066, one user described a way to use HF tokenization with the TF Text SentencePiece implementation, gist here. Note I have not tried this either.

NusretOzates commented 2 years ago

Great to hear! I would love to see that happen!

I also forgot to mention a 3rd approach to tokenization - IIRC it may be possible to use the metadata for some of HF's tokenizers to instantiate TF Text tokenizers with the same implementation, for example on issue #5066, one user described a way to use HF tokenization with the TF Text SentencePiece implementation, gist here. Note I have not tried this either.

I tried this and it works! The only thing is that you shouldn't use "Fast" implemantation. I've changed AutoTokenizer with AlbertTokenizer and it worked. Also, you need to install sentencepiece with "pip install sentencepiece"

rcrowe-google commented 2 years ago

+Jiayi Zhao @.> +Laurence Moroney @.>

That's great news, thanks Nusret! Have you written any documentation or examples for this yet?

Robert Crowe | TensorFlow Developer Engineer | @.*** | @robert_crowe https://twitter.com/robert_crowe

On Wed, Jun 22, 2022 at 12:52 AM Nusret Ozates @.***> wrote:

Great to hear! I would love to see that happen!

I also forgot to mention a 3rd approach to tokenization - IIRC it may be possible to use the metadata for some of HF's tokenizers to instantiate TF Text tokenizers with the same implementation, for example on issue #5066 https://github.com/huggingface/transformers/issues/5066, one user described a way to use HF tokenization with the TF Text SentencePiece implementation, gist here https://gist.github.com/noahtren/6f9f6ecf2f81d0975c4f54afaeb95318. Note I have not tried this either.

I tried this and it works! The only thing is that you shouldn't use "Fast" implemantation. I've changed AutoTokenizer with AlbertTokenizer and it worked. Also, you need to install sentencepiece with "pip install sentencepiece"

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/16447#issuecomment-1162771724, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVWSW63MFF6LILDGP4AKTDVQLA5BANCNFSM5R23PU3Q . You are receiving this because you were mentioned.Message ID: @.***>

NusretOzates commented 2 years ago

I've tried the example in this gist but now I've created a testable code here . Btw, I'm currently taking the MLOps Specialization on Coursera and the lessons are great, thanks for it!

rcrowe-google commented 2 years ago

Interesting that the shape is different, do you foresee any problems with that?

tf_tokenizer.tokenize(tf.strings.lower("merhaba"))
<tf.Tensor: shape=(3,), dtype=int32, numpy=array([   55, 16849,   969], dtype=int32)>

hf_tokenizer.encode("merhaba", add_special_tokens=False, return_tensors ="tf")
<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[   55, 16849,   969]], dtype=int32)>
NusretOzates commented 2 years ago

Probably the reason is hf tokenizer encodes the input as a batch of strings. So a tf.reshape operation can fix the difference but now I wonder how we can create other outputs. Normally we use:

# I generally not use return_tensors parameter 
hf_tokenizer(["hi", "this is me"], add_special_tokens=False, return_tensors ="tf", padding=True, truncation=True)

and get the output:

{
'input_ids': <tf.Tensor: shape=(2, 3), dtype=int32, numpy= array([[4148,    0,    0],  [  48,   25,   55]], dtype=int32)>, 
'token_type_ids': <tf.Tensor: shape=(2, 3), dtype=int32, numpy= array([[0, 0, 0],  [0, 0, 0]], dtype=int32)>, 
'attention_mask': <tf.Tensor: shape=(2, 3), dtype=int32, numpy=array([[1, 0, 0], [1, 1, 1]], dtype=int32)>
}

it seems like there is no easy way to use Tokenizers with TensorFlow. I can only think of looking at the tokenizer source code and writing the TensorFlow version of that to create these outputs and add other features of the tokenizer. So that a graph could be created to use as @rclough mentioned. Maybe a tokenization layer by subclassing base layer class could help.

gante commented 2 years ago

@NusretOzates we are precisely working on that [native TF tokenizers for transformers] at the moment -- see https://github.com/huggingface/transformers/pull/17701

NusretOzates commented 2 years ago

@gante I just checked the code and tests to see the usage and it looks great! Thanks a lot for the effort!

rclough commented 2 years ago

Nice to see progress getting BERT tokenization available for TFX.

On a side note, I found this interesting repo that converts a number of huggingface models (including tokenization) to TF Hub: https://github.com/jeongukjae/huggingface-to-tfhub

Rocketknight1 commented 2 years ago

@rclough Wow, it looks like they did a lot of work on precisely reimplementing tokenizers in TF there, that's extremely interesting!

rclough commented 2 years ago

@Rocketknight1 Defintely! I'm working with a team that's using their DistilBERT port they found through TF Hub (after first doing an MVP with HF), and discovered the repo through that. The implementation seems pretty high quality to me, mostly just reusing the vocab files and aligning the configuration with how TF Text does things.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.