[FEATURE REQUEST] Support for python preprocessing

lminer commented 6 years ago

We have to do a fair amount of text preprocessing of our data before feeding it into tensorflow. Since the text manipulation abilities of tensorflow and tensorflow tranform are still relatively immature, it is impossible for us to incorporate this preprocessing into the graph.

It would be super useful if there were some way to pass a python function in to perform preprocessing on the data before sending it to tensorflow. Maybe via tf.py_func?

This is related to this issue.

zacharynevin commented 6 years ago

You can write a custom ModelServer in C++: https://www.tensorflow.org/serving/custom_servable

lminer commented 6 years ago

I could do that or I could also write a custom op in tensorflow, but it would be nice if I could stay out of C++ for two reasons.

First, the preprocessing code for training is all in python and it would be a pain to have to maintain two bits of code in two separate languages that do the same thing. Second, the company I work at is not a C++ shop and don't want to take on the maintenance burden of supporting C++ code.

The only feasible option for us is to put the preprocessing code in the client, but this doesn't feel like the ideal solution. Preprocessing is tightly coupled with the model itself; I'd even argue it is part of the model (ala sklearn pipelines). With the client calling several different models with different preprocessing pipelines, we're essentially spreading a single model's code across two repos.

zacharynevin commented 6 years ago

What kind of preprocessing are you doing?

I've done processing steps directly in Tensorflow graphs for things like post-processing of YoloV2 bounding boxes (https://pjreddie.com/darknet/yolo/). I realize thats not quite text processing, but it still a fairly non-trivial task.

If you can describe what you are trying to do, maybe we can give pointers for how to do that directly in Tensorflow, if possible.

lminer commented 6 years ago

It's text preprocessing, in particular tokenizing and lowercasing. I looked at the tensorflow string functions as well as tf.transform and none of them seem to have capabilities that I need. The tokenizing is fairly specialized.

kirilg commented 6 years ago

The simplest option would be to use existing TF or tf.transform ops to accomplish what you want. If not directly (sounds like you explored this but couldn't find the right ops), then by stringing together existing ops to achieve what you want (e.g. this workaround that uses a lookup table to lowercase strings). It may not be ideal, but if it works, requires no serving changes.

If your preprocessing is generic enough and can be useful for others, you can propose feature requests to the TF or tf.transform repos with concrete suggestions and see if they can implement them or accept contributions.

If neither of the two above are feasible, then I think custom ops is the way to go since it's custom to your data. You'll unfortunately have to maintain C++ code and compile a custom ModelServer, but hopefully that's temporary until TF and tf.transform can do what you want with their standard ops.

My understanding is that tf.py_func is still not exported into the actual GraphDef and is not available outside of python, so the ModelServer can't use it.

AfrazHussain commented 6 years ago

I am facing a similar issue where I want to preprocess a string (or a batch of strings, during training) and transform it to a vector using Tensorflow's Vocabprocessor. That's pretty much the only transformation that I require. I have had a chance to look at tf.transform ops but I couldn't find any relevant ops, and I'm definitely missing something here.

Would be great if I can get something from here.

zeekvfu commented 6 years ago

Facing the similar issue too. I need to do lots of text preprocessing (lowercasing, a bunch of regex operations, feature encoding...). If TensorFlow Serving don't have a elegant support for these most basic needs, I have to say that it's actually pseudo serving.

wjarek-zz commented 6 years ago

TensorFlow Transform (https://github.com/tensorflow/transform) is likely a better fit for this use case. More info and an example: https://github.com/tensorflow/transform/blob/master/getting_started.md. On high level, TensorFlow Transform produces a preprocessing graph which you can use at training time, as well as include as a part of the serving graph (export) in order to avoid training-serving skew. If TensorFlow Transform (TFT) doesn't fit your use case, could you please file an issue against TFT with more details so the TFT team can take a look?

zeekvfu commented 6 years ago

@wjarek Thanks for your reply. I've taken a brief look at the TensorFlow Transform, and found this, and I don't think TFT can satisfy my needs.

Does TFT support lowercase or uppercase string? Does TFT support regex? NO and NO, let alone other kinds of strange or rare operations. When we say preprocessing or ETL, we mean that users can do all kinds of stuff(even some strange operations), users can do preprocessing quite well via Python native API(including third-party modules) or Spark(if preprocessing require a full pass over the dataset, or the original corpus is really large), but probably can't do it with tf.Transform. Sounds ironic?

I do think TensorFlow needs a module like tf.Transform to solve the training-serving skew issue and enable preprocessing which require a full pass over the dataset. But maybe TFT isn't doing it in a elegant manner? Even if you implement lowercasing/uppercasing and regex, will you implement other operations that TensorFlow users may request? To be accurate, it's re-implement, since Python native API (including third-party modules) has already done these, and TensorFlow users are more familiar with these. And after you implement all these, will users use these? Preprocessing is important, but not core of TensorFlow model training.

wrote quite a lot, and I hope I'm wrong. :-)

wjarek-zz commented 6 years ago

This is good feedback. Adding @kestertong to chime in on this thread.

KesterTong commented 6 years ago

Regarding tf.Transform (TFT), we do track the state of TF ops (in the tensorflow repo) for text processing, and we do provide some examples of using TF ops for text processing with tf.Transform (see the text processing examples). But tf.Transform does not provide any ops that don't exist in TF (or that users don't write themselves).

We have some helper functions (e.g. for n-grams) but these are built on top of TF ops in the tensorflow repo.

I hope that clarifies the role of tf.Transform when it comes to processing.

Some more information regarding other comments in the thread:

It's been stated in this thread that tf.py_func doesn't work in serving because tf.py_func does not get serialized. This is correct: tf.py_func was originally intended as a hook back into Python during model training. It was not intended to inject arbitrary Python into the serving graph.
VocabularyProcessor (and other functionality in tf.learn.preprocessing) does not operate using TF ops. E.g. the input to the fit method is an iterable of strings, not a Tensor. For this reason, VocabularyProcessor will not help with doing transformations in serving either.

KesterTong commented 6 years ago

@zeekvfu You've asked about the design choice of tf.Transform to only support transformations implemented within a tf.Graph, as opposed to supporting arbitrary Python.

This is indeed a significant choice, and there were many factors going into this choice. Probably the most significant factor is that TensorFlow is growing fast, and we expect the scope of things that can be done in TF ops to grow over time. Even if the precise operations a user has in mind may not exist as TF ops, there may be better solutions for the given problem available via TF ops. E.g. a user might be able to replace custom Python logic for image transformations with a pre-trained deep model.

wjarek-zz commented 6 years ago

@katsiapis (FYI)

ruanchong commented 6 years ago

Several preprocessing steps in NLP:

Replace numbers/punctuations with <pun>/<num>;
Use Spacy/NLTK/Stanford CoreNLP/jieba/... to tokenize/lemmatize/POS Tagging input sentence;
Case issue: uppercasing, lowercasing, truecasing, and recasing (See machine translation project Moses for reference);
Chinese word segmentation;
Use byte pair encoding to split words into subwords and resolve OOV issues;
and more.

I'm wondering if these features can be supported in the future.

sathyarr commented 5 years ago

I have exported a SavedModel successfully and could run that with TF Serving. But, I still could not serve. Because, I could not hook up the pre-processing step.

One of my pre-processing use case: I need to load the source and target vocabulary from an external file into a hash_table in the graph. The pre-processing is exactly as mentioned here in seq2seq library

Any help?

guillaumekln commented 5 years ago

Did you read this issue https://github.com/tensorflow/serving/issues/770?

sathyarr commented 5 years ago

Did you read this issue #770?

Thank you pointing out. I still have few queries.

I could see that, it seems to be the recommended way of doing. Are you asking to change these lookup lines to tf.contrib.lookup.index_table_from_file or tf.contrib.lookup.index_to_string_table_from_file?

As per the current implementation of seq2seq, the nodes(hash_table) for holding vocabulary details are created in the saved graph. But, there may not be any <key, value> pairs in those hash_tables upon serving with TF Serving(Hope, this is what's happening now!)

Incase, if we modify the code to use tf.contrib.lookup.index_table_from_file or tf.contrib.lookup.index_to_string_table_from_file, what will happen in the saved graph? At which point does Tensorflow Serving populate required lookup tables?

ParthTandel commented 5 years ago

Are there any updates made on this.

ParthTandel commented 5 years ago

I wanted to use tensor flow serve to create a restful service where i pass a bunch of text inputs in the rest API and use sklearn functions like tfidf vectorizers to preprocess text and then used sklearn cosine similarity function for some other logic on top of it.

hankcs commented 4 years ago

I've read many issues and articles, although the official maintainers are suggesting to use tf ops for preprocessing, the practitioners find it's inconvenient or simply impossible. Let's take the famous BERT model as an example, the tokenizer inside is not what tf.text can handle. As a result, every project I see is doing preprocessing in Flask and wrap tensorflow server with a Flask server. I feel really bad, it looks like you put a railway gun on a donkey.

liiitleboy commented 3 years ago

This is a good suggestion. For example, we can call it internally through pipeline

tensorflow / serving

[FEATURE REQUEST] Support for python preprocessing #663