Closed chandrashan closed 3 years ago
Seems that if I set get_x manually to a list of the columns, then it seems to work OK. i.e.
get_x = list(range(0,24))
My understanding was that get_x is supposed to get all the columns except the target column anyway?
My understanding was that get_x is supposed to get all the columns except the target column anyway?
I can confirm that get_x
does exactly this - though, I am not sure if if this happens anyway. As far as I understand it you (manually) must select the desired columns here.
ah gotcha - just wondering if it would be worth updating the tutorial notebook https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb to reflect that since when I ran that code as is it didn't seem to run till I manually selected the columns
I'll see if I can work out how to make a pull request!
Hi,
I'm not 100% sure what the issue is as I'm missing some details? What horizon are you using? 0?
SlidingWindow
will take all columns if get_x=None. And it will do the same if get_y=None. But to avoid leakage, it performs a check, so that get_x and get_y cannot be simultaneously set to None if the horizon is zero.
So the easiest way to resolve this is to choose horizon > 0 (if it's a prediction of a future step), or specify get_x and get_y so that they don't overlap.
@oguiza this is interesting: so for a regression problem I would want to predict 1 ... n steps ahead for a multivariate dataset I should set horizon = n and get_y = None and everything else should work out of the box? So it would perform multivariate forecasting?
In terms of data preparation you are correct. That’d be the way to prepare a multivariate, multi-step dataset. in terms of creating the model, it would not work out of the box. You’d need to pass a custom head to the model you choose. You may have noticed I have a bunch of models that end with Plus. Those allow you to create variations of the original models, and can take for example a custom head that creates a 2d output. I have not yet tested this approach, but I think it’d be easy to make it work. if you are interested and have a multivariate forecasting dataset you can prepare the dataset as you described, and I can create a custom 2d output head, so you can test it.
Actually, I want to perform anomaly detection. My problem is that the labels are rather weak. So I was thinking of using either an autoencoder based approach or maybe regression to utilize the reconstruction error. I certainly can prepare the data in such a way. But so far, do not yet fully understand how to create such a custom head. But if you could help preparing - this would be awesome.
When trying to prepare the data:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>
~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/tsai/data/preparation.py in _SlidingWindowPanel(df)
173 _key = []
174 for i, v in enumerate(progress_bar(unique_id_values)):
--> 175 x_v, y_v = SlidingWindow(window_len, stride=stride, start=start, get_x=get_x, get_y=get_y, y_func=y_func,
176 horizon=horizon, seq_first=seq_first, check_leakage=check_leakage)(df[(df[unique_id_cols].values == v).sum(axis=1) == len(v)])
177 if x_v is not None and len(x_v) > 0:
~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/tsai/data/preparation.py in _inner(o)
122 y = y[y_sub_windows]
123 if y_func is not None and len(y) > 0:
--> 124 y = y_func(y)
125 if y.ndim >= 2:
126 for d in np.arange(1, y.ndim)[::-1]:
<timed exec> in y_func(o)
~/development/conda_envs/my_conda_env/lib/python3.8/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
45 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
46 initial=_NoValue, where=True):
---> 47 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
48
49 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
and the date column is included it is causing problems. I guess this might be a bug. I do not want to regress on date, but predict n (hours) ahead. But date is needed in the sort argument of the slidingWindow operator.
I'm not sure what's causing your error, but it seems to be caused by the y_func that you are using. Make sure you test it separately before using it with SlidingWindow. SliwingWindow uses the data that you pass as sort_by just for sorting purposes. If you don't want to use them to create X you then have to indicate which columns you want to use and don't set get_x or get_y to None (in which case it will be selected). I'm working on a custom head that generates 3d output (batch size x feats x seq_len) for a separate project. I'll let you know when it's available.
Awesome.
But x should not include the hour column, right? You will sort the DF first and then extract only the X columns? Regarding y_func: this is set to None as you explained that for this regression use-case it should work this way.
But x should not include the hour column, right? You will sort the DF first and then extract only the X columns? Correct.
Can I pre-prepare the data set and drop the hour column? Somehow when setting get_y = None, I so far am unable to work around the error.
Can I pre-prepare the data set and drop the hour column?
Sure. No need to sort the df within the sliding window functions.
If you include any sort_by columns that shouldn't be part of the extracted data, the only way is to set get_x and get_y to the appropriate values. None would otherwise select all columns, including the ones you don't need.
get_x is set to the right values - however, get_y is set to none - as you mentioned that this is needed for auto regression (= regression for all the multivariate time-series). This includes the timestamp column - but obviously, it should rather be treated as an index. How could I exclude the timestamp column?
The only way I think this might be possible with the current implementation is to pre-sort the data before applying the SlidingWindowPanel operator. But I am confused how to disable sorting then for this operator.
I'm sorry, but I don't know which columns you have available. You should select as get_y all the columns that need to be predicted or that are arguments of the y_func (if any) that will generate the values to be predicted. Any column indicated in the sort_by will only be used to sort the df. All columns will be included if you set get_y to None.
ah so probably a misunderstanding on my side:
https://github.com/timeseriesAI/tsai/issues/36#issuecomment-752062808 was interpreted by me that I need to set get_y to None to get the auto-regressive behavior for all the columns. Instead, I should simply set get_y to all the columns except time (and thus exclude time) - then it should work.
I will close this issue now but please feel free to reopen if needed.
Hi, I was trying to follow the tutorial notebook on how to prepare data:
https://github.com/timeseriesAI/tsai/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb
I opened this in Google Colab, and have tried with and without the stable flag. My versions in Colab from the top cell are: tsai : 0.2.13 fastai : 2.1.10 fastcore : 1.3.13 torch : 1.7.0+cu101
Under the End-End examples/Single multivariate time series, I can load the first cell fine and see the df. However when I run the second cell to create the data loader, I get the following error:
Any suggestions?