Closed wangkuiyi closed 4 years ago
It seems that earlier than 2.0.0 alpha, TensorFlow doesn't have tf.keras.layers.DenseFeatures. What's the mechanism to convert data into feature then?
You can use tf.feature_column.input_layer.
Update: @wangkuiyi tf.feature_column.input_layer
is part of deprecating tf.layers
API.
Seem this issue is out of date, closing. Please refer to the latest implementation.
SQLFlow: Code Generation
SQLFlow needs to generate a training program given a SQL statement of extended-syntax. We are looking for a design that covers SQLFlow, SQL engines, and the AI engine. This document explains what is currently in my mind. For the simplicity without losing generality, let us assume the SQL engine is Alibaba ODPS.
The Problem
The starting point of this document is that some SQL programmers write SQL statements, and the end point is that SQLFlow generates a runnable training program in Python, which includes three parts:
tf.keras.Model
,tf.keras.Model
-derived class using a cluster of computers.The Input
The input of the above code generation work is SQL statements like
or
The above examples illustrate that SQLFlow users can hint the following information:
tf.keras.Model
-derived class, orLogisticRegression
in the above example,SELECT * FROM employees/students
, andCOLUMN *, cross(name, home_address)
.The Model
The mode, or the
tf.keras.Model
-derived class, takes a minibatch of rows from the result table as its input. Because the minibatch is from a SQL engine, it is intrinsically structural data, which can be represented by a Python dictionary. For example, a minibatch of five rows from theemployee
table might look like:The first layer of the model must know how to convert the minibatch into a dense tensor input. In TenosrFlow 2.0, the newly added
tf.kearas.layers.DenseFeatures
can do this conversion. In an example given in the official tutorial, we can define the model like the following:where the
feature_columns
parameter offeature_layer
is a Python list of some feature column API calls, each corresponds to a feature, for example:It is the modeling team's work to define the model where the first layer is
tf.keras.DenseFeatures
; it is the SQLFlow code generator's work to create the feature columns list.Question 1: It seems that earlier than 2.0.0 alpha, TensorFlow doesn't have
tf.keras.layers.DenseFeatures
. What's the mechanism to convert data into feature then?The Data
The AI engine reads data into minibatches from the result table. With TensorFlow graph model, we need to wrap each data source, for example, MySQL or Alibaba ODPS, into a TensorFlow dataset operator. With Eager Execution or the eager mode, the training program can call the data access API directly.
Suppose that SQLFlow is working with Alibaba ODPS, the AI engine can call ODPS's reader API:
If we want to shard the input, say, a worker of the AI engine wants to read only part of the rows from 1000 to 2000, we change the line:
into
Question: 2: In TensorFlow graph mode, it seems that the only way we can read from ODPS is via the ODPSDataset operator, and the only way to shard is by calling shard. I doubt that TensorFlow's shard function works with ODPS for sharding, right? How efficient is TensorFlow's shard function when working with ODPS?
The Metadata
When SQLFlow generates the calls to
tf.feature_columns.*
functions, it needs to provide parameters like the vocabulary and the number of hash buckets. To decide which function to call, and to decide the parameters, SQLFlow needs to scan over the result table. Let us take some examples.If there is a field
gender
, which takes one of three values:male
,female
, andunknown
. By scanning over the result table, or, a sufficient number of rows of it, SQLFlow should be able to identify the vocabulary of three values and decide to calltf.feature_column.categorical_column_with_vocabulary_list
.Another example is
name
, which has too big a vocabulary list and the scanning should suggest callingtf.feature_column.categorical_column_with_hash_bucket
.A third example is the call to
cross(name, age)
in the first SQL code snippet of this document -- the user tells clearly that s/he wants a new feature created by a call totf.feature_column.crosssed_feature
.In short, SQLFlow needs a set of heuristic rules that consider the SQL data type of fields and real data. There were discussions about saving the scanning results, or statistics of the data, into a metadata table; it looks to me unnecessary if SQLFlow generates calls to
tf.feature_column.*
functions.The AI Engine
Given the above discussion, we see that the API of the AI engine could be as simple as