Support TO RUN on SQLFlow

sql-machine-learning / sqlflow

Brings SQL and AI together.

Apache License 2.0

5.09k stars 699 forks source link

SQLFlow describes an end-to-end machine learning pipeline. Data transformation is an important part in the entire process.

For the per data instance transformation, we will use COLUMN clause.
For the non per data instance transformation, we propose TO RUN clause to support this functionality. Please refer to the discussion #2137

Please check the following example SQL statement:

SELECT * FROM {source_table}
TO RUN {function_name}
WITH
    param_a = value_a,
    param_b = value_b
INTO {result_table}

{function_name} is the name of data transformation function. It can be either a built-in function from SQLFlow or the customized function provided by the users. We will support built-in function at the first step. TSFresh is our first built-in function.

{source_table} is the name of the input table from which the transform function above read the data. {result_table} is the name of the output table into which the transform function above will write the processed result.

The design doc

link.

Task break down

[x] Upgrade parser to support TO RUN statement
[ ] Translate TO RUN to a workflow
[x] Upgrade goalisa to submit PyODPS task. Enable submitting ODPS SQL and PyODPS task on the deployment of Dataworks.
[ ] sqlflow.runner module.
[ ] TSFresh high level api implementation and docker image.
[ ] TO RUN function repo sample.

def add_extracted_ts_features( iterator, column_id, column_time, column_values, windows, extract_setting): """ Extract the features from the input containing the time series data and then append the extract features into the source input. Arguments: iterator: The iterator for the input data. id_column: str. The name of the id column to group by. time_column: str. The name of the time column. value_columns: List of str. The name of the columns for the time series data. windows: List of int. The sliding window sizes with which we will try to roll original the data. extract_setting: str. The feature extraction setting. It's one of the values in ['Comprehensive', 'Efficient', 'IndexBased', 'Minimal', 'TimeBased'] Returns: A pandas.DataFrame. It contains both the original column from the input iterator and the extracted feature columns from the time series columns. """ pass

SELECT * FROM source_table TO RUN add_extracted_ts_features WITH id_column = id, time_column = record_date, value_columns = ['pv', 'uv'], windows = [1, 5, 10], extract_setting = Minimal INTO result_table

sql-machine-learning / sqlflow

Support TO RUN on SQLFlow #2161

The design doc

Task break down

SQLFlow Built-in TSFresh High-Level Python Api

Use this API in SQL expression