Chronos TSDataset: various enhancement

TheaperDeng commented 3 years ago

[x] Enable customized feature generation in gen_dt_feature.
[x] API doc for TSDataset.
[ ] Onehotencoder operation.
[x] _unscale_numpy -> unscale_numpy with api doc
[x] resample need default start_time and end_time

cabuliwallah commented 3 years ago

add YEAR feature in gen_dt_feature.

cabuliwallah commented 3 years ago

remove the quote in result column names of gen_dt_feature. e.g. MONTH(StartTime) -> MONTH

liangs6212 commented 3 years ago

When non_pd_datetime appears, impute("linear") will cause the following error Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype

def get_multi_id_ts_df():
    return train_df.astype('object')
tsdata= TSDataset.from_pandas(df, target_col="value", dt_col="datetime",extra_feature_col=['extra feature'])
tsdata.impute("linear")

liangs6212 commented 3 years ago

df = pd.DataFrame({"datetime":np.arange(100),
                            "id":np.array(['00']*100),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})

non_pd_datetime

tsdata.resample('2D') AttributeErr=or: unsupported operand type(s) for -: 'numpy.float64' and 'Timestamp'
tsdata.gen_dt_feature() AttributeError: Can only use .dt accessor with datetimelike values
tsdata.gen_rolling_feature(window_size=10) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)

not_aligned

def not_aligned():
    df_val = pd.DataFrame({"id":np.array(['00']*20+['01']*30+['02']*50),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})
    data_sec = pd.DataFrame({"datetime": pd.date_range(start='1/1/2019 00:00:00',periods=20,freq='S')})
    data_min = pd.DataFrame({"datetime": pd.date_range(start='1/2/2019 00:00:00',periods=30,freq='H')})
    data_hou = pd.DataFrame({"datetime": pd.date_range(start='1/3/2019 00:00:00',periods=50,freq='D')})
    dt_val = pd.concat([data_sec,data_min,data_hou],axis=0,ignore_index=True)
    df = pd.merge(left=dt_val,right=df_val,left_index=True,right_index=True)
    return df

tsdata.resample('2D').roll(lookback=5,horizon=2,id_sensitive=True).to_numpy() # ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 2 has 3 dimension(s)
tsdata.gen_global_feature ,Same as resample
gen_rolling_feature(window_size=30) IndexError: single positional indexer is out-of-bounds,(Appears when window_size is too large)
tsdata.roll(lookback=5,horizon=2,id_sensitive=True) numpy.AxisError: axis 2 is out of bounds for array of dimension 1.

cabuliwallah commented 3 years ago

When calling scale(scaler, fit=False) multiple times, it should behave like calling it only once. Since it's effective only once when fit=True.

df = pd.DataFrame({"datetime": np.array(['1/1/2019', '1/2/2019']),
                    "value": np.array([1, 2])})
df_test = pd.DataFrame({"datetime": np.array(['1/3/2019', '1/4/2019']),
                    "value": np.array([1, 2])})
tsdata = TSDataset.from_pandas(df,
                               dt_col="datetime",
                               target_col="value")
tsdata_test = TSDataset.from_pandas(df_test,
                               dt_col="datetime",
                               target_col="value")
standard_scaler = StandardScaler()
tsdata.scale(standard_scaler, fit=True)
tsdata_test.scale(standard_scaler, fit=False).scale(standard_scaler, fit=False)
print(tsdata_test.df)

The expected output value column is [-1, 1], currently it is [-5, -1]

liangs6212 commented 3 years ago

Test tsdata random call, there will be the following three types of errors.(use get_multip_df)

gen_global_feature().gen_rolling_feature() / gen_global_feature().gen_global_feature()
- Dict keys are not allowed to contain '': extra featurevariance_larger_than_standard_deviation
gen_dt_feature().gen_global_feature()
- numpy boolean subtract, the - operator, is not supported, use the bitwise_xor, the ^ operator, or the logical_xor function instead.
scale(fit=False)
- not fitted.

cabuliwallah commented 3 years ago

In utils/feature.py, function _is_weekend(): the line return (weekday >= 5).values should be changed to return (weekday >= 5).astype(int).values

intel-analytics / analytics-zoo

Chronos TSDataset: various enhancement #227

non_pd_datetime

not_aligned