wangkuiyi commented 5 years ago

SQLFlow: Code Generation

SQLFlow needs to generate a training program given a SQL statement of extended-syntax. We are looking for a design that covers SQLFlow, SQL engines, and the AI engine. This document explains what is currently in my mind. For the simplicity without losing generality, let us assume the SQL engine is Alibaba ODPS.

The Problem

The starting point of this document is that some SQL programmers write SQL statements, and the end point is that SQLFlow generates a runnable training program in Python, which includes three parts:

the model resides in a model base, which could be a Git repo contributed by the modeling team, where each model is a Python class derived from tf.keras.Model,
the data-to-feature conversion part is supposed to be generated automatically by SQLFlow because the extended SQL syntax allows users to hint the generation, and
the backbone of the training program is the AI engine, which is a framework that can train a tf.keras.Model-derived class using a cluster of computers.

The Input

The input of the above code generation work is SQL statements like

SELECT * FROM employees
TRAIN LogisticRegerssion
COLUMN *, cross(name, age)
LABEL salary
INTO a_simple_model;

or

SELECT * FROM students
TRAIN LogisticRegression
COLUMN *, cross(name, home_address)
LABEL year_end_score
INTO another_model;

The above examples illustrate that SQLFlow users can hint the following information:

the name of the tf.keras.Model-derived class, or LogisticRegression in the above example,
the temporary table that saves the result of the standard SQL part, like SELECT * FROM employees/students, and
the columns, or field-to-feature mapping, like COLUMN *, cross(name, home_address).

The Model

The mode, or the tf.keras.Model-derived class, takes a minibatch of rows from the result table as its input. Because the minibatch is from a SQL engine, it is intrinsically structural data, which can be represented by a Python dictionary. For example, a minibatch of five rows from the employee table might look like:

minibatch = {
    "name", ["Bob", "Ted", "Ray", "Ed"],
    "age": [22, 32, 40, 56],
    "salary": [100, 200, 300, 400],
}

The first layer of the model must know how to convert the minibatch into a dense tensor input. In TenosrFlow 2.0, the newly added tf.kearas.layers.DenseFeatures can do this conversion. In an example given in the official tutorial, we can define the model like the following:

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

where the feature_columns parameter of feature_layer is a Python list of some feature column API calls, each corresponds to a feature, for example:

feature_columns = [
    tf.feature_column.categorical_column_with_hash_bucket("name", hash_bucket_size = 100),
    tf.feature_column.numeric_column("age"),
]

It is the modeling team's work to define the model where the first layer is tf.keras.DenseFeatures; it is the SQLFlow code generator's work to create the feature columns list.

Question 1: It seems that earlier than 2.0.0 alpha, TensorFlow doesn't have tf.keras.layers.DenseFeatures. What's the mechanism to convert data into feature then?

The Data

The AI engine reads data into minibatches from the result table. With TensorFlow graph model, we need to wrap each data source, for example, MySQL or Alibaba ODPS, into a TensorFlow dataset operator. With Eager Execution or the eager mode, the training program can call the data access API directly.

Suppose that SQLFlow is working with Alibaba ODPS, the AI engine can call ODPS's reader API:

r = odps.models.partition.open_reader(result_table)
count = reader.count
for record in r[0: count]:
    d = convert_record_into_python_dict(record)
    print(model(d)) # call the model's forward pass.

If we want to shard the input, say, a worker of the AI engine wants to read only part of the rows from 1000 to 2000, we change the line:

for record in r[0: count]:

into

for record in r[1000: 2000]:

Question: 2: In TensorFlow graph mode, it seems that the only way we can read from ODPS is via the ODPSDataset operator, and the only way to shard is by calling shard. I doubt that TensorFlow's shard function works with ODPS for sharding, right? How efficient is TensorFlow's shard function when working with ODPS?

The Metadata

When SQLFlow generates the calls to tf.feature_columns.* functions, it needs to provide parameters like the vocabulary and the number of hash buckets. To decide which function to call, and to decide the parameters, SQLFlow needs to scan over the result table. Let us take some examples.

If there is a field gender, which takes one of three values: male, female, and unknown. By scanning over the result table, or, a sufficient number of rows of it, SQLFlow should be able to identify the vocabulary of three values and decide to call tf.feature_column.categorical_column_with_vocabulary_list.
Another example is name, which has too big a vocabulary list and the scanning should suggest calling tf.feature_column.categorical_column_with_hash_bucket.
A third example is the call to cross(name, age) in the first SQL code snippet of this document -- the user tells clearly that s/he wants a new feature created by a call to tf.feature_column.crosssed_feature.

In short, SQLFlow needs a set of heuristic rules that consider the SQL data type of fields and real data. There were discussions about saving the scanning results, or statistics of the data, into a metadata table; it looks to me unnecessary if SQLFlow generates calls to tf.feature_column.* functions.

The AI Engine

Given the above discussion, we see that the API of the AI engine could be as simple as

ai.train(
    class = sqlflow.models.LogisticRegression,
    feature_columns = feature_columns,
    data_source = sqlflow.ODPS,
    table_name = result_table)

tonyyang-svail commented 5 years ago

It seems that earlier than 2.0.0 alpha, TensorFlow doesn't have tf.keras.layers.DenseFeatures. What's the mechanism to convert data into feature then?

You can use tf.feature_column.input_layer.

Update: @wangkuiyi tf.feature_column.input_layer is part of deprecating tf.layers API.

typhoonzero commented 4 years ago

Seem this issue is out of date, closing. Please refer to the latest implementation.

sql-machine-learning / sqlflow

Confirm the overall design #245