sql-machine-learning / sqlflow

Brings SQL and AI together.
https://sqlflow.org
Apache License 2.0
5.09k stars 699 forks source link

Support TO RUN on SQLFlow #2161

Open brightcoder01 opened 4 years ago

brightcoder01 commented 4 years ago

SQLFlow describes an end-to-end machine learning pipeline. Data transformation is an important part in the entire process.

Please check the following example SQL statement:

SELECT * FROM {source_table}
TO RUN {function_name}
WITH
    param_a = value_a,
    param_b = value_b
INTO {result_table}

{function_name} is the name of data transformation function. It can be either a built-in function from SQLFlow or the customized function provided by the users. We will support built-in function at the first step. TSFresh is our first built-in function.

{source_table} is the name of the input table from which the transform function above read the data. {result_table} is the name of the output table into which the transform function above will write the processed result.

The design doc

link.

Task break down

brightcoder01 commented 4 years ago
SQLFlow Built-in TSFresh High-Level Python Api
def add_extracted_ts_features(
    iterator,
    column_id,
    column_time,
    column_values,
    windows,
    extract_setting):
    """
    Extract the features from the input containing the time series
    data and then append the extract features into the source input.
    Arguments:
        iterator:
            The iterator for the input data.
        id_column:
            str. The name of the id column to group by.
        time_column:
            str. The name of the time column.
        value_columns:
            List of str. The name of the columns for the time series data.
        windows:
            List of int. The sliding window sizes with which we will 
            try to roll original the data.
        extract_setting:
            str. The feature extraction setting. It's one of the values in
            ['Comprehensive', 'Efficient', 'IndexBased',
             'Minimal', 'TimeBased']
    Returns:
        A pandas.DataFrame. It contains both the original column from
        the input iterator and the extracted feature columns from the time series
        columns.
    """
    pass
Use this API in SQL expression
SELECT * FROM source_table
TO RUN add_extracted_ts_features
WITH
    id_column = id,
    time_column = record_date,
    value_columns = ['pv', 'uv'],
    windows = [1, 5, 10],
    extract_setting = Minimal
INTO result_table