tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 707 forks source link

RFP 001: Adding non-Transform feature engineering #610

Open rcrowe-google opened 5 years ago

rcrowe-google commented 5 years ago

TFX RFP 001 - Request for community comments and proposals

Adding Non-Transform Feature Engineering

An explanation of a specific critical user journey (CUJ) for windowing, and a general discussion of the need

robertcrowe@ 12 September 2019

Dataset and Modeling Goal

As part of creating training for TFX we received a sample of data which contains tracking of services by number of unique customers. The data forms a discontinuous time series with each example aggregating a 1 minute window. The data represents the demand for services, which drives the need for staffing. The modeling goal is to predict the demand one hour from now on frequent basis (running inference every 10 minutes as being a good goal).

Model

The time series is discontinuous, which is a common problem for time series data. The gaps in the time series are a problem for sequence-based models for at least two reasons:

There are at least two commonly used ways of dealing with the discontinuities:

This second is the more general approach, and was the option chosen. A fairly generic model architecture was chosen and delivered good results:

def _build_model():
    inputs = tf.keras.Input(shape=(SEQUENCE, 1))

    x1 = tf.keras.layers.Dropout(rate=DROPOUT_RATE)(inputs)
    x1 = tf.keras.layers.LSTM(LSTM_HIDDEN,
      kernel_initializer='glorot_uniform',
      bias_initializer='zeros',)(x1)

    x2 = tf.keras.layers.Conv1D(filters=CONV_FILTERS,
      kernel_size=CONV_KERNEL,
      kernel_initializer='glorot_uniform',
      bias_initializer='zeros')(inputs)

    x2 = tf.keras.layers.Flatten()(x2)
    x = tf.keras.layers.concatenate([x1, x2])

    x = tf.keras.layers.Dense(DENSE_HIDDEN,
      kernel_initializer='glorot_uniform',
      bias_initializer='zeros',
      activation='relu')(x)

    predictions = tf.keras.layers.Dense(1,
      kernel_initializer='glorot_uniform',
      bias_initializer='zeros')(x)

    model = tf.keras.Model(inputs=inputs, outputs=predictions)

    return model

Training - Options for Preprocessing

This kind of preprocessing appears to be beyond the capabilities of Transform. For scalability, data processing should use Beam, and Beam does support sliding window aggregations. That suggests at least 3 options:

  1. Preprocess the data using Beam into sliding windows before entering the TFX pipeline.
  2. Preprocess the data using Beam into sliding windows in a custom ExampleGen.
  3. Preprocess the data using Beam into sliding windows in a custom component downstream of the standard ExampleGen.

During development a pure Python approach was developed as a temporary way to proceed with model development while Beam code was being developed. That code aggregated the data into sliding windows before entering the TFX pipeline. Beam code was developed, but for the training session we were developing this for we ran out of time to integrate it into the pipeline and there seemed to be performance problems, so we went with the pure Python approach for the training. Subsequently it appears that the performance problems were probably caused by using the DirectRunner, which has performance limitations, and not the Beam code.

Inference - Options for Preprocessing

It’s less clear how to integrate this preprocessing for inference, and there appears to be a significant potential for training/serving skew.

TFX Inference Pipeline

A TFX pipeline could be built, ending in a component making client calls to a Serving instance and including the same component as #2 or #3 above. Assuming that pipeline latency would be acceptable (?) this approach would rely on source control and consistent deployment to avoid training/serving skew. While not ideal, this is probably the best approach currently available.

Preprocessing Beam Pipeline

A Beam pipeline could be implemented to preprocess the data before delivering it to the Serving client. A separate instance of this same pipeline could be used to generate the training dataset by feeding training examples into it and archiving the output before sending it to a TFX training pipeline. Since the same pipeline could be used for both training and serving, this would avoid training/serving skew.

Specific General Need for Windowing

Sliding windows are a well accepted technique for working with discontinuous time series data:

We’ve also seen CUJs at Google which demonstrate a need for windowing. We also have a current bug from gTech asking for windowing in TFX.

Overall Need for Non-Transform Processing

Transform is a very strong part of the TFX offering because of the advanced distributed data processing which it enables, and the integration of the transformations into the SavedModel. However, since the transformations must be included in a SavedModel they are limited to only those operations that can be implemented in TensorFlow Ops. Integration of the transformations into the SavedModel prevents training/serving skew. When transformations are required which cannot be implemented in TensorFlow Ops we currently have no design patterns, best practices, or methodologies for avoiding training/serving skew. This will mean that there are CUJs for which TFX implementations will include the potential for training/serving skew.

A Proposal

One possible way to address this problem is to create a “SavedPipeline” specification, which would include:

For TF.Serving-style deployments this might be implemented with a TFX architecture including components which run inference. For TF.Lite and TFJS-style deployments it is less clear how this would be implemented.

Some notes:

robertlugg commented 5 years ago

Could you add a brief definition of "CUJ" to your RFP? Am I summarizing the problem statement correctly as: "There are operations we need to apply to the dataset which cannot be expressed using a Tf.Transform(). How might we represent such operations so that those same operations are performed both during training and during inferencing?"

rcrowe-google commented 5 years ago

CUJ == Critical user journey

Yes, that is essentially the problem statement. I might add "... during inferencing in TF.Serving, and ideally also TF Lite, and TFJS deployments."

masip85 commented 4 years ago

Aren't there updates about this or is it being discussed in other thread?