Error creating multivariate dataloader

chandrashan commented 3 years ago

Hi, I was trying to follow the tutorial notebook on how to prepare data:

https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb

I opened this in Google Colab, and have tried with and without the stable flag. My versions in Colab from the top cell are: tsai : 0.2.13 fastai : 2.1.10 fastcore : 1.3.13 torch : 1.7.0+cu101

Under the End-End examples/Single multivariate time series, I can load the first cell fine and see the df. However when I run the second cell to create the data loader, I get the following error:

    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    <ipython-input-3-dbbb5a3104e2> in <module>()
          7 seq_first = True
          8 
    ----> 9 X, y = SlidingWindow(window_length, stride=stride, start=start, get_x=get_x,  get_y=get_y, horizon=horizon, seq_first=seq_first)(df)
         10 splits = get_splits(y, valid_size=.2, stratify=True, random_state=23, shuffle=False)
         11 tfms  = [None, [Categorize()]]

    /usr/local/lib/python3.6/dist-packages/tsai/data/preparation.py in SlidingWindow(window_len, stride, start, get_x, get_y, y_func, horizon, seq_first, sort_by, ascending, check_leakage)
         93     if min_horizon <= 0 and y_func is None and get_y != [] and check_leakage:
         94         assert get_x is not None and  get_y is not None and len([y for y in _get_y if y in _get_x]) == 0,  \
    ---> 95         'you need to change either horizon, get_x, get_y or use a y_func to avoid leakage'
         96     stride = ifnone(stride, window_len)
         97 

    AssertionError: you need to change either horizon, get_x, get_y or use a y_func to avoid leakage
    ---------------------------

Any suggestions?

chandrashan commented 3 years ago

Seems that if I set get_x manually to a list of the columns, then it seems to work OK. i.e.

get_x = list(range(0,24))

My understanding was that get_x is supposed to get all the columns except the target column anyway?

geoHeil commented 3 years ago

My understanding was that get_x is supposed to get all the columns except the target column anyway?

I can confirm that get_x does exactly this - though, I am not sure if if this happens anyway. As far as I understand it you (manually) must select the desired columns here.

logic-language commented 3 years ago

ah gotcha - just wondering if it would be worth updating the tutorial notebook https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb to reflect that since when I ran that code as is it didn't seem to run till I manually selected the columns

I'll see if I can work out how to make a pull request!

oguiza commented 3 years ago

Hi, I'm not 100% sure what the issue is as I'm missing some details? What horizon are you using? 0? SlidingWindow will take all columns if get_x=None. And it will do the same if get_y=None. But to avoid leakage, it performs a check, so that get_x and get_y cannot be simultaneously set to None if the horizon is zero. So the easiest way to resolve this is to choose horizon > 0 (if it's a prediction of a future step), or specify get_x and get_y so that they don't overlap.

geoHeil commented 3 years ago

@oguiza this is interesting: so for a regression problem I would want to predict 1 ... n steps ahead for a multivariate dataset I should set horizon = n and get_y = None and everything else should work out of the box? So it would perform multivariate forecasting?

oguiza commented 3 years ago

In terms of data preparation you are correct. That’d be the way to prepare a multivariate, multi-step dataset. in terms of creating the model, it would not work out of the box. You’d need to pass a custom head to the model you choose. You may have noticed I have a bunch of models that end with Plus. Those allow you to create variations of the original models, and can take for example a custom head that creates a 2d output. I have not yet tested this approach, but I think it’d be easy to make it work. if you are interested and have a multivariate forecasting dataset you can prepare the dataset as you described, and I can create a custom 2d output head, so you can test it.

geoHeil commented 3 years ago

Actually, I want to perform anomaly detection. My problem is that the labels are rather weak. So I was thinking of using either an autoencoder based approach or maybe regression to utilize the reconstruction error. I certainly can prepare the data in such a way. But so far, do not yet fully understand how to create such a custom head. But if you could help preparing - this would be awesome.

When trying to prepare the data:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/tsai/data/preparation.py in _SlidingWindowPanel(df)
    173         _key = []
    174         for i, v in enumerate(progress_bar(unique_id_values)):
--> 175             x_v, y_v = SlidingWindow(window_len, stride=stride, start=start, get_x=get_x, get_y=get_y, y_func=y_func,
    176                                      horizon=horizon, seq_first=seq_first, check_leakage=check_leakage)(df[(df[unique_id_cols].values == v).sum(axis=1) == len(v)])
    177             if x_v is not None and len(x_v) > 0:

~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/tsai/data/preparation.py in _inner(o)
    122             y = y[y_sub_windows]
    123             if y_func is not None and len(y) > 0:
--> 124                 y = y_func(y)
    125             if y.ndim >= 2:
    126                 for d in np.arange(1, y.ndim)[::-1]:

<timed exec> in y_func(o)

~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     45 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     46          initial=_NoValue, where=True):
---> 47     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     48 
     49 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

and the date column is included it is causing problems. I guess this might be a bug. I do not want to regress on date, but predict n (hours) ahead. But date is needed in the sort argument of the slidingWindow operator.

oguiza commented 3 years ago

I'm not sure what's causing your error, but it seems to be caused by the y_func that you are using. Make sure you test it separately before using it with SlidingWindow. SliwingWindow uses the data that you pass as sort_by just for sorting purposes. If you don't want to use them to create X you then have to indicate which columns you want to use and don't set get_x or get_y to None (in which case it will be selected). I'm working on a custom head that generates 3d output (batch size x feats x seq_len) for a separate project. I'll let you know when it's available.

geoHeil commented 3 years ago

Awesome.

But x should not include the hour column, right? You will sort the DF first and then extract only the X columns? Regarding y_func: this is set to None as you explained that for this regression use-case it should work this way.

oguiza commented 3 years ago

But x should not include the hour column, right? You will sort the DF first and then extract only the X columns? Correct.

geoHeil commented 3 years ago

Can I pre-prepare the data set and drop the hour column? Somehow when setting get_y = None, I so far am unable to work around the error.

oguiza commented 3 years ago

Can I pre-prepare the data set and drop the hour column?

Sure. No need to sort the df within the sliding window functions.

If you include any sort_by columns that shouldn't be part of the extracted data, the only way is to set get_x and get_y to the appropriate values. None would otherwise select all columns, including the ones you don't need.

geoHeil commented 3 years ago

get_x is set to the right values - however, get_y is set to none - as you mentioned that this is needed for auto regression (= regression for all the multivariate time-series). This includes the timestamp column - but obviously, it should rather be treated as an index. How could I exclude the timestamp column?

The only way I think this might be possible with the current implementation is to pre-sort the data before applying the SlidingWindowPanel operator. But I am confused how to disable sorting then for this operator.

oguiza commented 3 years ago

I'm sorry, but I don't know which columns you have available. You should select as get_y all the columns that need to be predicted or that are arguments of the y_func (if any) that will generate the values to be predicted. Any column indicated in the sort_by will only be used to sort the df. All columns will be included if you set get_y to None.

geoHeil commented 3 years ago

ah so probably a misunderstanding on my side:

https://github.com/timeseriesAI/tsai/issues/36#issuecomment-752062808 was interpreted by me that I need to set get_y to None to get the auto-regressive behavior for all the columns. Instead, I should simply set get_y to all the columns except time (and thus exclude time) - then it should work.

oguiza commented 3 years ago

I will close this issue now but please feel free to reopen if needed.

timeseriesAI / tsai

Error creating multivariate dataloader #36