timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.07k stars 633 forks source link

Df2xy causing incorrect splits #666

Closed Zwayeh closed 1 year ago

Zwayeh commented 1 year ago

Unsure if this is an issue with my code/understanding, or a bug with the df2xy function. When attempting to generate splits for multivariant data, I am finding the data being split by feature instead of sample; this of course causes incorrect training / testing.

To test, I have imported the NATOPS UCR dataset and shown the splits generated below:

Original Splits

I have then converted this dataset to a pandas df, and converted it back to X, y by utilising the df2xy function:

df2xy splits

As you can see, the splits change from correctly being split by sample, to being split by feature. Any help would be greatly appreciated :)

oguiza commented 1 year ago

Hi @Zwayeh, I've investigated this issue you mention, and found a bug in the code. It's related to a pandas sorting issue I wasn't aware of. You can replicate the pandas issue with this code if interested:

df=pd.DataFrame(np.repeat(np.arange(5), 4), columns=["values"])
df["values2"] = df["values"]
df['rank'] = np.arange(len(df))
df.sort_values(['values'])

When sorting by a single column, you may get this strange behavior. To avoid that you need to pass kind='stable':

df.sort_values(['values'], kind='stable')

You can now confirm the code works by doing this:

X, y, _ = get_UCR_data('NATOPS', split_data=False)
splits = get_splits(y, valid_size=.2, stratify=.2, random_state=34, shuffle=True)
df = pd.DataFrame(X.swapaxes(0,1).reshape(X.shape[1], -1).T)
df['target'] = np.repeat(y, X.shape[-1])
df['sample_id'] = np.repeat(np.arange(len(X)), X.shape[-1])
test_eq(X, X_df)
test_eq(y, y_df)

I've created a gist to demonstrate how the fixed code works.

Please, let me know if this fixes your question (if so please, close this issue).

oguiza commented 1 year ago

Closed due to a lack of response.