[Discussion] How to generate time series features using tsfresh in SQLFlow

sql-machine-learning / sqlflow

Brings SQL and AI together.

https://sqlflow.org

Apache License 2.0

5.07k stars 699 forks source link

[Discussion] How to generate time series features using tsfresh in SQLFlow #2137

Open Yancey1989 opened 4 years ago

Yancey1989 commented 4 years ago

tsfresh calculates a large number of time series characteristics automatically, which wildly used in time series modeling, we have two options to integrate it in SQLFlow.

Option 1

SELECT * FROM ... 
TO TRAIN Regression 
COLUMN TSFRESH(t, v1), TSFRESH(t, v2) 
INTO my_model;

This Option, COLUMN CLAUSE generate step by step Python function calls and then train Regression models:

dataset = tsfresh(dataset, t, v1)
dataset = tsfresh(dataset, t, v2)
train(dataset)

Con:

users can only use ONE statement to deal with preprocessing and training, it's corresponded with COLUMN design.

Option 2

SELECT * FROM ... TO RUN TSFRESH(t, v1,v2) INTO t1;
SELECT * FROM t1 TO TRAIN Regression INTO my_model;

Con:

tsfresh generates some additional columns, if we want to explain the trained model, it's more meaningful for users.

shendiaomo commented 4 years ago

I prefer Option 1.

Users don't have to know whether TSFRESH would generate a new table.
That's why we design the COLUMN clause: to describe the feature engineering.

tsfresh generates some additional columns, if we want to explain the trained model, it's more meaningful for users.

Can we document the tsfresh feature to help users understand this behavior? After all, we've already shown crossed features to them in the decision plot, even if they didn't specify any feature combinations.

brightcoder01 commented 4 years ago

Some background

Input data format of tsfresh: link Time series forecasting using tsfresh: link

A simple case for time series forecasting

predict(previous_hours, pv_of_previous_hours) -> pv_of_this_hour

t is short for current_hour, l is the length of time window.

predict(t-1, pv(t-1), t-2, pv(t-2), ... , t-l, pv(t-l)) -> pv(t)

This is a regression problem, we can train it using XGBoost or others.

Features directly from the source table: t, pv(t-1), pv(t-2), ...., pv(t-l)
Label: pv(t)

tsfresh will run on the time series data pv(t), ...., pv(t-l) to get more features: simple statistic values such as mean, variance, autocorrelatation, count_above_mean .etc; complex values such as T_x__fft_coefficient__coeff_0__attr_"abs" and so on.

And then we can get the data containing the following columns:

t, pv(t-1), pv(t-2), ... pv(t-l), derived_feature_1, derived_feature_2, ..., derived_feature_N

No we can feed the features above and label into XGBoost to train a model.

typhoonzero commented 4 years ago

Users don't have to know whether TSFRESH would generate a new table.

If we need to explain the model, the explain result will have features that the user won't even know. If we let the COLUMN clause to output a table storing automatically generated features by tsfresh, the user still have to specify a table name.

That's why we design the COLUMN clause: to describe the feature engineering.

After all, we'd like the user don't need to use the COLUMN clause when writing a TO TRAIN statement, the columns are naturally derived from the SELECT statement.

shendiaomo commented 4 years ago

Users don't have to know whether TSFRESH would generate a new table.

If we need to explain the model, the explain result will have features that the user won't even know. If we let the COLUMN clause to output a table storing automatically generated features by tsfresh, the user still have to specify a table name.

That's why we design the COLUMN clause: to describe the feature engineering.

After all, we'd like the user don't need to use the COLUMN clause when writing a TO TRAIN statement, the columns are naturally derived from the SELECT statement.

I still believe the COLUMN clause should be the solution. For example, we may need to specify several different windows, with COLUMN we can write:

SELECT * FROM my_ts_table TO TRAIN xgboost.gbtree WITH objective='reg:squarederror' 
    COLUMN 
       TSFRESH(t, "v1|v2", 7), 
       TSFRESH(t, "v1|v2", 30), 
       TSFRESH(t, v1, 180)
INTO my_ts_xgb_model;

Without the COLUMN clause, the statements will be tedious and error-prone:

SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;
SELECT * FROM 
    (SELECT * FROM 
        (SELECT * from my_ts_table JOIN t7 ON my_ts_table.t = t7.t) x
     JOIN t30 ON x.t = t30.t) y
JOIN t_v1_180 ON y.t = t_v1_180.t
TO TRAIN xgboost.gbtree WITH objective='reg:squarederror' INTO my_ts_xgb_model;

It will be a nightmare for users to maintain theses statements if they want to add or remove calls to TSFRESH later.

brightcoder01 commented 4 years ago

We can combine the following three expression into one:

SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;

SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;

For XGBoost explain, we need the data table containing both the original features from source table and the derived features. So we need the specific table name table_with_derived_feature to execute the explain SQL just as follows:

SELECT * FROM table_with_derived_feature TO EXPLAIN my_model

So it would be more user-friendly to put TSFRESH to TO RUN clause.

Yancey1989 commented 4 years ago

I prefer the TO RUN clause.COLUMN TSFRESH would generate some additional columns, that may make confusing to users. TO RUN execute a Python function call which input is a table (SELECT ...) and output is a table.

SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;

In addition, users can publish the Python function definition in Model zoo as a Docker image, the above TO RUN TSFRESH clause would be like:

SELECT * FROM my_ts_table
TO RUN my-registry/yanxu/tsfresh:latest/run WITH
    ts_col='t',
    value_col='v1,v2',
    windows=[7, 30, 180]
INTO table_with_derived_feature;

brightcoder01 commented 4 years ago

COLUMN VS TO RUN

The model definition in model zoo is not a complete model. If we want to make the model complete based on the schema of source table. We will choose COLUMN clause. The code generated from COLUMN clause runs together with each model training iteration. COLUMN clause is an attribute of TO TRAIN, it describes/decorates how we convert the source data (SELECT * FROM) into the model definition in model zoo for each data instance.
If we want to transform the data before the model training process instead of making the model definition complete. We have already complete the transformation on the entire source table and get the result table before executing training. We can use SQL + UDF or use TO RUN clause - Such as TSFRESH.

shendiaomo commented 4 years ago

I prefer the TO RUN clause.COLUMN TSFRESH would generate some additional columns, that may make confusing to users. TO RUN execute a Python function call which input is a table (SELECT ...) and output is a table.

SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;

In addition, users can publish the Python function definition in Model zoo as a Docker image, the above TO RUN TSFRESH clause would be like:
SELECT * FROM my_ts_table
TO RUN my-registry/yanxu/tsfresh:latest/run WITH
    ts_col='t',
    value_col='v1,v2',
    windows=[7, 30, 180]
INTO table_with_derived_feature;

In fact, my original example is to generate [7, 30] for v2 and [7, 30, 180] for v1, it seems the WITH clause cannot avoid JOIN for this?

shendiaomo commented 4 years ago

We can combine the following three expression into one:
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;
For XGBoost explain, we need the data table containing both the original features from source table and the derived features. So we need the specific table name table_with_derived_feature to execute the explain SQL just as follows:
SELECT * FROM table_with_derived_feature TO EXPLAIN my_model
So it would be more user-friendly to put TSFRESH to TO RUN clause.

About how to avoid too complicated `JOIN` in `TO RUN`, there's another point from @brightcoder01

Use a pipe-like mechanism to stack the TRANSFORM statements

SELECT * FROM table1 TO TRANSFORM py_func1;  -- add some columns to table1
SELECT * FROM table1 TO TRANSFORM py_func2;  -- add some columns to table2
...

About the semantics of the `TO RUN` syntax, from @shendiaomo

approved by @Yancey1989 @brightcoder01

Rename TO RUN to TO TRANSFORM to give explicit semantics to the new syntax: data transforming.
Think TO TRANSFORM as a way to express calling a SQLFlow UDF
- Pitfalls: TO TRANSFORM cannot be nested in other SELECT statements, this is different from UDFs in standard SQL

About the semantics of the `COLUMN` syntax, from @brightcoder01 @shendiaomo

COLUMN is something that has to be bundled to the saved model, like tf.SavedModel
It's not the right time for us to derive an exact design of this problem for XGBoost at the moment.

brightcoder01 commented 4 years ago

Add more comments about COLUMN semantics

The logic in COLUMN clause will be executed per data instance during model training/prediction.
It's a part of the complete model definition.
It will be exported to a serialized format together with the main model in model zoo. It can be used for both batch prediction and online serving.

sql-machine-learning / sqlflow

[Discussion] How to generate time series features using tsfresh in SQLFlow #2137

Option 1

Option 2

Some background

A simple case for time series forecasting

About how to avoid too complicated JOIN in TO RUN, there's another point from @brightcoder01

About the semantics of the TO RUN syntax, from @shendiaomo

About the semantics of the COLUMN syntax, from @brightcoder01 @shendiaomo

About how to avoid too complicated `JOIN` in `TO RUN`, there's another point from @brightcoder01

About the semantics of the `TO RUN` syntax, from @shendiaomo

About the semantics of the `COLUMN` syntax, from @brightcoder01 @shendiaomo