Open Yancey1989 opened 4 years ago
I prefer Option 1.
TSFRESH
would generate a new table. COLUMN
clause: to describe the feature engineering.tsfresh generates some additional columns, if we want to explain the trained model, it's more meaningful for users.
Can we document the tsfresh
feature to help users understand this behavior? After all, we've already shown crossed features to them in the decision plot
, even if they didn't specify any feature combinations.
Input data format of tsfresh: link Time series forecasting using tsfresh: link
predict(previous_hours, pv_of_previous_hours) -> pv_of_this_hour
t
is short for current_hour, l
is the length of time window.
predict(t-1, pv(t-1), t-2, pv(t-2), ... , t-l, pv(t-l)) -> pv(t)
This is a regression problem, we can train it using XGBoost or others.
t, pv(t-1), pv(t-2), ...., pv(t-l)
pv(t)
tsfresh will run on the time series data pv(t), ...., pv(t-l)
to get more features: simple statistic values such as mean
, variance
, autocorrelatation
, count_above_mean
.etc; complex values such as T_x__fft_coefficient__coeff_0__attr_"abs"
and so on.
And then we can get the data containing the following columns:
t, pv(t-1), pv(t-2), ... pv(t-l), derived_feature_1, derived_feature_2, ..., derived_feature_N
No we can feed the features above and label into XGBoost to train a model.
Users don't have to know whether TSFRESH would generate a new table.
If we need to explain the model, the explain result will have features that the user won't even know. If we let the COLUMN
clause to output a table storing automatically generated features by tsfresh, the user still have to specify a table name.
That's why we design the COLUMN clause: to describe the feature engineering.
After all, we'd like the user don't need to use the COLUMN
clause when writing a TO TRAIN
statement, the columns are naturally derived from the SELECT
statement.
Users don't have to know whether TSFRESH would generate a new table.
If we need to explain the model, the explain result will have features that the user won't even know. If we let the
COLUMN
clause to output a table storing automatically generated features by tsfresh, the user still have to specify a table name.That's why we design the COLUMN clause: to describe the feature engineering.
After all, we'd like the user don't need to use the
COLUMN
clause when writing aTO TRAIN
statement, the columns are naturally derived from theSELECT
statement.
I still believe the COLUMN
clause should be the solution. For example, we may need to specify several different windows, with COLUMN
we can write:
SELECT * FROM my_ts_table TO TRAIN xgboost.gbtree WITH objective='reg:squarederror'
COLUMN
TSFRESH(t, "v1|v2", 7),
TSFRESH(t, "v1|v2", 30),
TSFRESH(t, v1, 180)
INTO my_ts_xgb_model;
Without the COLUMN
clause, the statements will be tedious and error-prone:
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;
SELECT * FROM
(SELECT * FROM
(SELECT * from my_ts_table JOIN t7 ON my_ts_table.t = t7.t) x
JOIN t30 ON x.t = t30.t) y
JOIN t_v1_180 ON y.t = t_v1_180.t
TO TRAIN xgboost.gbtree WITH objective='reg:squarederror' INTO my_ts_xgb_model;
It will be a nightmare for users to maintain theses statements if they want to add or remove calls to TSFRESH later.
We can combine the following three expression into one:
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;
For XGBoost explain, we need the data table containing both the original features from source table and the derived features. So we need the specific table name table_with_derived_feature
to execute the explain SQL just as follows:
SELECT * FROM table_with_derived_feature TO EXPLAIN my_model
So it would be more user-friendly to put TSFRESH
to TO RUN
clause.
I prefer the TO RUN
clause.COLUMN TSFRESH
would generate some additional columns, that may make confusing to users. TO RUN
execute a Python function call which input is a table (SELECT ...
) and output is a table.
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;
In addition, users can publish the Python function definition
in Model zoo as a Docker image, the above TO RUN TSFRESH
clause would be like:
SELECT * FROM my_ts_table
TO RUN my-registry/yanxu/tsfresh:latest/run WITH
ts_col='t',
value_col='v1,v2',
windows=[7, 30, 180]
INTO table_with_derived_feature;
COLUMN
VS TO RUN
The model definition in model zoo is not a complete model. If we want to make the model complete based on the schema of source table. We will choose COLUMN
clause. The code generated from COLUMN
clause runs together with each model training iteration.
COLUMN
clause is an attribute of TO TRAIN
, it describes/decorates how we convert the source data (SELECT * FROM
) into the model definition in model zoo for each data instance.
If we want to transform the data before the model training process instead of making the model definition complete. We have already complete the transformation on the entire source table and get the result table before executing training. We can use SQL + UDF or use TO RUN
clause - Such as TSFRESH
.
I prefer the
TO RUN
clause.COLUMN TSFRESH
would generate some additional columns, that may make confusing to users.TO RUN
execute a Python function call which input is a table (SELECT ...
) and output is a table.SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;
In addition, users can publish the
Python function definition
in Model zoo as a Docker image, the aboveTO RUN TSFRESH
clause would be like:SELECT * FROM my_ts_table TO RUN my-registry/yanxu/tsfresh:latest/run WITH ts_col='t', value_col='v1,v2', windows=[7, 30, 180] INTO table_with_derived_feature;
In fact, my original example is to generate [7, 30]
for v2
and [7, 30, 180]
for v1
, it seems the WITH
clause cannot avoid JOIN
for this?
We can combine the following three expression into one:
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",7) INTO t7; SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2",30) INTO t30; SELECT * FROM my_ts_table TO RUN TSFRESH(t, v1, 180) INTO t_v1_180;
SELECT * FROM my_ts_table TO RUN TSFRESH(t, "v1|v2", [7, 30, 180]) INTO table_with_derived_feature;
For XGBoost explain, we need the data table containing both the original features from source table and the derived features. So we need the specific table name
table_with_derived_feature
to execute the explain SQL just as follows:SELECT * FROM table_with_derived_feature TO EXPLAIN my_model
So it would be more user-friendly to put
TSFRESH
toTO RUN
clause.
JOIN
in TO RUN
, there's another point from @brightcoder01TRANSFORM
statements
SELECT * FROM table1 TO TRANSFORM py_func1; -- add some columns to table1
SELECT * FROM table1 TO TRANSFORM py_func2; -- add some columns to table2
...
TO RUN
syntax, from @shendiaomoapproved by @Yancey1989 @brightcoder01
TO RUN
to TO TRANSFORM
to give explicit semantics to the new syntax: data transforming.TO TRANSFORM
as a way to express calling a SQLFlow UDF
TO TRANSFORM
cannot be nested in other SELECT
statements, this is different from UDF
s in standard SQLCOLUMN
syntax, from @brightcoder01 @shendiaomoCOLUMN
is something that has to be bundled to the saved model, like tf.SavedModel
Add more comments about COLUMN
semantics
COLUMN
clause will be executed per data instance during model training/prediction.
tsfresh calculates a large number of time series characteristics automatically, which wildly used in time series modeling, we have two options to integrate it in SQLFlow.
Option 1
This Option,
COLUMN CLAUSE
generate step by step Python function calls and then train Regression models:Con:
ONE
statement to deal with preprocessing and training, it's corresponded with COLUMN design.Option 2
Con:
tsfresh
generates some additional columns, if we want to explain the trained model, it's more meaningful for users.