kdgutier / esrnn_torch

MIT License
156 stars 44 forks source link

How to prepare the data format? #42

Closed gsamaras closed 2 years ago

gsamaras commented 2 years ago

I have read the Medium example and executed the example. I understand from the README that X_df must contain the columns ['unique_id', 'ds', 'x'] and y_df must contain the columns ['unique_id', 'ds', 'y'].

Now in my case I have:

     time              usage        requests  hits 
5    1.589360e+12  1575.6520        5074.0  1588.0
12   1.589360e+12  1575.6520        5074.0  1588.0
19   1.589360e+12  1527.7666        5042.0  1580.0
26   1.589360e+12  1527.7666        5042.0  1580.0
33   1.589360e+12  1477.7880        5297.0  1584.0
...

and I want to predict usage. How should I map my data to the required input format? I can only think of time getting mapped to ds.

I have read many issues (such as this one), but I wasn't able to figure out how to map my data.

kdgutier commented 2 years ago

Hey @gsamaras

Try doing this:

X_df['ds'] = X_df['time'].values
Y_df['ds'] = Y_df['time'].values
del X_df['time'], Y_df['time']

Also from your example it is a bit difficult to know what format your time is in. You might want to take a look to the pd.to_datetime converter or use the np.date_range() function. Take a look to this StackOverflow post.

By the way, I recommend you to use the NeuralForecast library as we are migrating our attention to that repository.

gsamaras commented 2 years ago

@kdgutier my problem is not how to code it, but the logic. I had figured out the time column, but what about the others? I mean I want to predict the usage column of my data, somehow I should inject it into X_df, right?

Yes, that was my next question since I saw it in another issue, thanks. But since I opened the issue here we can continue the discussion here if you like; otherwise I can migrate it.

kdgutier commented 2 years ago

Here is Google Colab N-HiTS example.

You should rename your 'usage' -> 'y'.

gsamaras commented 2 years ago

Is it really as simple as:

X_df['ds'] = X_df['time'].values
y_df['ds'] = y_df['time'].values
X_df['x'] = X_df['bw'].values
y_df['y'] = y_df['bw'].values
X_df['unique_id']='dummy'
y_df['unique_id']='dummy'

# same for test data

@kdgutier?

Train compltes, but when I try to predict: y_hat_df = model.predict(X_test_df) I get:

ValueError: You are trying to merge on float64 and datetime64[ns] columns. If you wish to proceed you should use pd.concat

which I think happens because of some incompatibility between time and ds perhaps, which is another issue I guess.

kdgutier commented 2 years ago

I recommend to convert your 'time' column to a date stamp using the pd.to_datetime, a lot of methods in the Neuralforecast library rely on you sending datetime formatted 'ds'.

The line X_df['x'] = X_df['bw'].values will cause you leakage. The dataset already considers autorregresive features by default if you restrict to send only Y_df

gsamaras commented 2 years ago

@kdgutier apologies for the late response because of the weekend. OK I think we are very close, but now predict crashes with batch size (=32). Here is the situation:

y = df.pop('y')
X = df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
y_train = y_train.to_frame()
X_train['unique_id']='dummy'
y_train['unique_id']='dummy'
y_train['ds'] = X_train['ds']

print(X_train)
print(y_train)
# same for test data
...

which outputs:

       x                            ds unique_id
0     fe 2020-05-13 08:45:57.228000000     dummy
1     fe 2020-05-13 08:46:58.343000064     dummy
2     fe 2020-05-13 08:47:59.299000064     dummy
3     fe 2020-05-13 08:49:00.236000000     dummy
4     fe 2020-05-13 08:50:01.188999936     dummy
...   ..                           ...       ...
6887  fe 2020-05-18 06:05:54.928000000     dummy
6888  fe 2020-05-18 06:06:55.985999872     dummy
6889  fe 2020-05-18 06:07:57.731000064     dummy
6890  fe 2020-05-18 06:08:58.804999936     dummy
6891  fe 2020-05-18 06:09:59.864000000     dummy

[6892 rows x 3 columns]
              y unique_id                            ds
0     1575.6520     dummy 2020-05-13 08:45:57.228000000
1     1575.6520     dummy 2020-05-13 08:46:58.343000064
2     1527.7666     dummy 2020-05-13 08:47:59.299000064
3     1527.7666     dummy 2020-05-13 08:49:00.236000000
4     1477.7880     dummy 2020-05-13 08:50:01.188999936
...         ...       ...                           ...
6887  1675.4131     dummy 2020-05-18 06:05:54.928000000
6888  1641.9484     dummy 2020-05-18 06:06:55.985999872
6889  1646.2307     dummy 2020-05-18 06:07:57.731000064
6890  1646.2307     dummy 2020-05-18 06:08:58.804999936
6891  1650.9961     dummy 2020-05-18 06:09:59.864000000

[6892 rows x 3 columns]
       x                            ds unique_id
6892  fe 2020-05-18 06:11:00.937999872     dummy
6893  fe 2020-05-18 06:12:02.014000128     dummy
6894  fe 2020-05-18 06:13:03.060000000     dummy
6895  fe 2020-05-18 06:14:04.118000128     dummy
6896  fe 2020-05-18 06:15:05.411000064     dummy
...   ..                           ...       ...
8610  fe 2020-05-19 11:28:28.334000128     dummy
8611  fe 2020-05-19 11:29:29.504000000     dummy
8612  fe 2020-05-19 11:30:30.544000000     dummy
8613  fe 2020-05-19 11:31:31.724000000     dummy
8614  fe 2020-05-19 11:32:32.780000000     dummy

[1723 rows x 3 columns]
              y unique_id                            ds
6892  1652.7509     dummy 2020-05-18 06:11:00.937999872
6893  1616.0997     dummy 2020-05-18 06:12:02.014000128
6894  1616.0997     dummy 2020-05-18 06:13:03.060000000
6895  1725.8965     dummy 2020-05-18 06:14:04.118000128
6896  1790.9973     dummy 2020-05-18 06:15:05.411000064
...         ...       ...                           ...
8610  1007.4689     dummy 2020-05-19 11:28:28.334000128
8611  1020.8758     dummy 2020-05-19 11:29:29.504000000
8612  1020.8758     dummy 2020-05-19 11:30:30.544000000
8613  1059.2924     dummy 2020-05-19 11:31:31.724000000
8614  1025.6858     dummy 2020-05-19 11:32:32.780000000

[1723 rows x 3 columns]

Then I train exactly as explained in the Medium post:

model = ESRNN(max_epochs=3, freq_of_test=1, batch_size=32, ...)
model.fit(X_train, y_train)

and predict like this:

y_hat = model.predict(X_test)

which crashes:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-58-01b48d2fa9d2>](https://localhost:8080/#) in <module>()
      5 print(y_test.shape)
      6 
----> 7 y_hat = model.predict(X_test)
      8 
      9 # Evaluate predictions

1 frames
[/usr/local/lib/python3.7/dist-packages/ESRNN/utils/data.py](https://localhost:8080/#) in update_batch_size(self, new_batch_size)
     84   def update_batch_size(self, new_batch_size):
     85     self.batch_size = new_batch_size
---> 86     assert self.batch_size <= self.n_series
     87     self.n_batches = int(np.ceil(self.n_series / self.batch_size))
     88 

AssertionError:

What am I missing? Test size (1723) is greater than batch size (32).

kdgutier commented 2 years ago

The dataloader takes into consideration the number of 'unique_id', you have all of them as 'dummy' in your example.

gsamaras commented 2 years ago

Hmm to be honest I don't really understand how the unique_id applies in my case, where I want to do univariate time series forecasting on y.

How should I solve this problem @kdgutier please? Because it's not clear if I would have to use as many unique_ids as the batch size, and if yes, should the number of them be balanced across the dataset? I am really lost here.

kdgutier commented 2 years ago

Because the maximum batch_size that the dataloader can use here is 1, the ESRNN is a recurrent network. It needs to sequentially visit all the observations of your series.

gsamaras commented 2 years ago

Does that mean that I am constrained in using batch size of 1?

kdgutier commented 2 years ago

If you are using a windows based model like the N-BEATS, N-HiTS or any other MLP based you can have bigger batch_size. If you are using a pure RNN model, by construction the amount of series you can use is one. Unless you do some special work on the dataloader/series preprocessing.

kdgutier commented 2 years ago

I suggest you to move to the NeuralForecast library, and try running the N-HiTS example.