Closed gsamaras closed 2 years ago
Hey @gsamaras
Try doing this:
X_df['ds'] = X_df['time'].values
Y_df['ds'] = Y_df['time'].values
del X_df['time'], Y_df['time']
Also from your example it is a bit difficult to know what format your time is in. You might want to take a look to the pd.to_datetime
converter or use the np.date_range()
function. Take a look to this StackOverflow post.
By the way, I recommend you to use the NeuralForecast library as we are migrating our attention to that repository.
@kdgutier my problem is not how to code it, but the logic. I had figured out the time column, but what about the others? I mean I want to predict the usage
column of my data, somehow I should inject it into X_df
, right?
Yes, that was my next question since I saw it in another issue, thanks. But since I opened the issue here we can continue the discussion here if you like; otherwise I can migrate it.
Here is Google Colab N-HiTS example.
You should rename your 'usage' -> 'y'.
Is it really as simple as:
X_df['ds'] = X_df['time'].values
y_df['ds'] = y_df['time'].values
X_df['x'] = X_df['bw'].values
y_df['y'] = y_df['bw'].values
X_df['unique_id']='dummy'
y_df['unique_id']='dummy'
# same for test data
@kdgutier?
Train compltes, but when I try to predict:
y_hat_df = model.predict(X_test_df)
I get:
ValueError: You are trying to merge on float64 and datetime64[ns] columns. If you wish to proceed you should use pd.concat
which I think happens because of some incompatibility between time
and ds
perhaps, which is another issue I guess.
I recommend to convert your 'time' column to a date stamp using the pd.to_datetime
, a lot of methods in the Neuralforecast library rely on you sending datetime formatted 'ds'.
The line X_df['x'] = X_df['bw'].values
will cause you leakage.
The dataset already considers autorregresive features by default if you restrict to send only Y_df
@kdgutier apologies for the late response because of the weekend. OK I think we are very close, but now predict crashes with batch size (=32
). Here is the situation:
y = df.pop('y')
X = df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
y_train = y_train.to_frame()
X_train['unique_id']='dummy'
y_train['unique_id']='dummy'
y_train['ds'] = X_train['ds']
print(X_train)
print(y_train)
# same for test data
...
which outputs:
x ds unique_id
0 fe 2020-05-13 08:45:57.228000000 dummy
1 fe 2020-05-13 08:46:58.343000064 dummy
2 fe 2020-05-13 08:47:59.299000064 dummy
3 fe 2020-05-13 08:49:00.236000000 dummy
4 fe 2020-05-13 08:50:01.188999936 dummy
... .. ... ...
6887 fe 2020-05-18 06:05:54.928000000 dummy
6888 fe 2020-05-18 06:06:55.985999872 dummy
6889 fe 2020-05-18 06:07:57.731000064 dummy
6890 fe 2020-05-18 06:08:58.804999936 dummy
6891 fe 2020-05-18 06:09:59.864000000 dummy
[6892 rows x 3 columns]
y unique_id ds
0 1575.6520 dummy 2020-05-13 08:45:57.228000000
1 1575.6520 dummy 2020-05-13 08:46:58.343000064
2 1527.7666 dummy 2020-05-13 08:47:59.299000064
3 1527.7666 dummy 2020-05-13 08:49:00.236000000
4 1477.7880 dummy 2020-05-13 08:50:01.188999936
... ... ... ...
6887 1675.4131 dummy 2020-05-18 06:05:54.928000000
6888 1641.9484 dummy 2020-05-18 06:06:55.985999872
6889 1646.2307 dummy 2020-05-18 06:07:57.731000064
6890 1646.2307 dummy 2020-05-18 06:08:58.804999936
6891 1650.9961 dummy 2020-05-18 06:09:59.864000000
[6892 rows x 3 columns]
x ds unique_id
6892 fe 2020-05-18 06:11:00.937999872 dummy
6893 fe 2020-05-18 06:12:02.014000128 dummy
6894 fe 2020-05-18 06:13:03.060000000 dummy
6895 fe 2020-05-18 06:14:04.118000128 dummy
6896 fe 2020-05-18 06:15:05.411000064 dummy
... .. ... ...
8610 fe 2020-05-19 11:28:28.334000128 dummy
8611 fe 2020-05-19 11:29:29.504000000 dummy
8612 fe 2020-05-19 11:30:30.544000000 dummy
8613 fe 2020-05-19 11:31:31.724000000 dummy
8614 fe 2020-05-19 11:32:32.780000000 dummy
[1723 rows x 3 columns]
y unique_id ds
6892 1652.7509 dummy 2020-05-18 06:11:00.937999872
6893 1616.0997 dummy 2020-05-18 06:12:02.014000128
6894 1616.0997 dummy 2020-05-18 06:13:03.060000000
6895 1725.8965 dummy 2020-05-18 06:14:04.118000128
6896 1790.9973 dummy 2020-05-18 06:15:05.411000064
... ... ... ...
8610 1007.4689 dummy 2020-05-19 11:28:28.334000128
8611 1020.8758 dummy 2020-05-19 11:29:29.504000000
8612 1020.8758 dummy 2020-05-19 11:30:30.544000000
8613 1059.2924 dummy 2020-05-19 11:31:31.724000000
8614 1025.6858 dummy 2020-05-19 11:32:32.780000000
[1723 rows x 3 columns]
Then I train exactly as explained in the Medium post:
model = ESRNN(max_epochs=3, freq_of_test=1, batch_size=32, ...)
model.fit(X_train, y_train)
and predict like this:
y_hat = model.predict(X_test)
which crashes:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
[<ipython-input-58-01b48d2fa9d2>](https://localhost:8080/#) in <module>()
5 print(y_test.shape)
6
----> 7 y_hat = model.predict(X_test)
8
9 # Evaluate predictions
1 frames
[/usr/local/lib/python3.7/dist-packages/ESRNN/utils/data.py](https://localhost:8080/#) in update_batch_size(self, new_batch_size)
84 def update_batch_size(self, new_batch_size):
85 self.batch_size = new_batch_size
---> 86 assert self.batch_size <= self.n_series
87 self.n_batches = int(np.ceil(self.n_series / self.batch_size))
88
AssertionError:
What am I missing? Test size (1723) is greater than batch size (32).
The dataloader takes into consideration the number of 'unique_id', you have all of them as 'dummy' in your example.
Hmm to be honest I don't really understand how the unique_id
applies in my case, where I want to do univariate time series forecasting on y
.
How should I solve this problem @kdgutier please? Because it's not clear if I would have to use as many unique_id
s as the batch size, and if yes, should the number of them be balanced across the dataset? I am really lost here.
Because the maximum batch_size
that the dataloader can use here is 1, the ESRNN is a recurrent network.
It needs to sequentially visit all the observations of your series.
Does that mean that I am constrained in using batch size of 1?
If you are using a windows based model like the N-BEATS, N-HiTS or any other MLP based you can have bigger batch_size
. If you are using a pure RNN model, by construction the amount of series you can use is one.
Unless you do some special work on the dataloader/series preprocessing.
I suggest you to move to the NeuralForecast library, and try running the N-HiTS example.
I have read the Medium example and executed the example. I understand from the README that
X_df
must contain the columns['unique_id', 'ds', 'x']
andy_df
must contain the columns['unique_id', 'ds', 'y']
.Now in my case I have:
and I want to predict
usage
. How should I map my data to the required input format? I can only think oftime
getting mapped tods
.I have read many issues (such as this one), but I wasn't able to figure out how to map my data.