intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
11 stars 3 forks source link

Chronos TSDataset: various enhancement #227

Open TheaperDeng opened 3 years ago

TheaperDeng commented 3 years ago
cabuliwallah commented 3 years ago

add YEAR feature in gen_dt_feature.

cabuliwallah commented 3 years ago

remove the quote in result column names of gen_dt_feature. e.g. MONTH(StartTime) -> MONTH

liangs6212 commented 3 years ago

When non_pd_datetime appears, impute("linear") will cause the following error Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype

def get_multi_id_ts_df():
    return train_df.astype('object')
tsdata= TSDataset.from_pandas(df, target_col="value", dt_col="datetime",extra_feature_col=['extra feature'])
tsdata.impute("linear")
liangs6212 commented 3 years ago
df = pd.DataFrame({"datetime":np.arange(100),
                            "id":np.array(['00']*100),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})

non_pd_datetime

not_aligned

def not_aligned():
    df_val = pd.DataFrame({"id":np.array(['00']*20+['01']*30+['02']*50),
                            "value":np.random.randn(100),
                            "extra feature":np.random.randn(100)})
    data_sec = pd.DataFrame({"datetime": pd.date_range(start='1/1/2019 00:00:00',periods=20,freq='S')})
    data_min = pd.DataFrame({"datetime": pd.date_range(start='1/2/2019 00:00:00',periods=30,freq='H')})
    data_hou = pd.DataFrame({"datetime": pd.date_range(start='1/3/2019 00:00:00',periods=50,freq='D')})
    dt_val = pd.concat([data_sec,data_min,data_hou],axis=0,ignore_index=True)
    df = pd.merge(left=dt_val,right=df_val,left_index=True,right_index=True)
    return df
cabuliwallah commented 3 years ago

When calling scale(scaler, fit=False) multiple times, it should behave like calling it only once. Since it's effective only once when fit=True.

df = pd.DataFrame({"datetime": np.array(['1/1/2019', '1/2/2019']),
                    "value": np.array([1, 2])})
df_test = pd.DataFrame({"datetime": np.array(['1/3/2019', '1/4/2019']),
                    "value": np.array([1, 2])})
tsdata = TSDataset.from_pandas(df,
                               dt_col="datetime",
                               target_col="value")
tsdata_test = TSDataset.from_pandas(df_test,
                               dt_col="datetime",
                               target_col="value")
standard_scaler = StandardScaler()
tsdata.scale(standard_scaler, fit=True)
tsdata_test.scale(standard_scaler, fit=False).scale(standard_scaler, fit=False)
print(tsdata_test.df)

The expected output value column is [-1, 1], currently it is [-5, -1]

liangs6212 commented 3 years ago

Test tsdata random call, there will be the following three types of errors.(use get_multip_df)

cabuliwallah commented 3 years ago

In utils/feature.py, function _is_weekend(): the line return (weekday >= 5).values should be changed to return (weekday >= 5).astype(int).values