Closed dnth closed 3 years ago
I think you need to pass in splits to the TSDatasets rather than generating three separate datasets.
I'm pretty confused too. I read this tutorial https://github.com/timeseriesAI/timeseriesAI/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb but can't figure out how to get my pandas dataframe into the right format.
The dataframe just has two columns I care about: Usage_Mwh (Y) and unix_timestamp (X)
Example:
data = {"unix_timestamp":{"6":1451606400.0,"1558":1454284800.0,"3110":1456790400.0,"4662":1459468800.0,"6214":1462060800.0,"7766":1464739200.0,"9318":1467331200.0,"10870":1470009600.0,"12422":1472688000.0,"13974":1475280000.0,"15526":1477958400.0,"17078":1480550400.0,"18630":1483228800.0,"20182":1485907200.0,"21734":1488326400.0,"23286":1491004800.0,"24838":1493596800.0,"26390":1496275200.0,"27942":1498867200.0,"29494":1501545600.0,"31046":1504224000.0,"32598":1506816000.0,"34150":1509494400.0,"35702":1512086400.0,"37254":1514764800.0,"38806":1517443200.0,"40358":1519862400.0,"41910":1522540800.0,"43462":1525132800.0,"45014":1527811200.0,"46566":1530403200.0,"48118":1533081600.0,"49670":1535760000.0,"51222":1538352000.0,"52774":1541030400.0,"54326":1543622400.0,"55878":1546300800.0,"57430":1548979200.0,"58982":1551398400.0,"60534":1554076800.0,"62086":1556668800.0,"63638":1559347200.0,"65190":1561939200.0,"66742":1564617600.0,"68294":1567296000.0,"69846":1569888000.0,"71398":1572566400.0,"72950":1575158400.0,"74502":1577836800.0,"76054":1580515200.0,"77606":1583020800.0},"Usage_MWh":{"6":5.34858,"1558":3.78055,"3110":3.4831,"4662":3.74901,"6214":3.02347,"7766":7.63334,"9318":5.62975,"10870":5.51058,"12422":4.36067,"13974":3.29915,"15526":2.76066,"17078":2.94552,"18630":2.7777,"20182":2.76716,"21734":5.78573,"23286":4.8537129444,"24838":3.1271232778,"26390":2.8168842646,"27942":2.8774968882,"29494":2.8774968882,"31046":2.7846744079,"32598":2.8774968882,"34150":2.7846744079,"35702":2.8774968882,"37254":2.8774968882,"38806":2.5990294474,"40358":2.8774968882,"41910":2.5288257571,"43462":2.5724465738,"45014":3.3510199005,"46566":3.060722109,"48118":2.6989527056,"49670":2.5984900474,"51222":3.4421489889,"52774":3.4093083871,"54326":3.5249084516,"55878":3.1468401143,"57430":3.0175142015,"58982":3.3731491579,"60534":3.0313829708,"62086":3.1152347778,"63638":3.2681218106,"65190":3.4173398852,"66742":2.897951582,"68294":3.3056545,"69846":3.1457436,"71398":2.646408469,"72950":2.5245141129,"74502":2.7281552182,"76054":6.5999127071,"77606":7.7869288929}}
df = pd.DataFrame.from_dict(data)
window_length = 5
X, y = SlidingWindow(window_length, get_x=['unix_timestamp'], get_y='Usage_MWh')(df)
itemify(X, y)
tfms = [None, [ToFloat(), ToNumpyTensor()]]
dsets = TSDatasets(X, y, tfms=tfms)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid)
print(dls.vars) # Fails
Hi @dnth,
thanks for your interest in the tsai
library.
In fastai, you usually create a datasets object with train and valid splits only, and then add a test set if necessary.
In tsai
it works the same way.
You'd usually do:
dsets = TSDatasets(X, y, tfms=tfms, splits=splits) # splits for train and valid only
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=64) # bs may be an int or a list of ints
and then you can add a test ds:
test_ds = dsets.valid.add_test(X_test, y_test)
test_dl = dls.valid.new(test_ds)
The reason why this is preferable is that sometimes the tfms require a setup on the train set (for example TSStandardize()). In this way, the same tfms applied to valid will be applied to the test set.
Hi @Sandyxuxinxi, Your example is correct except that you are not passing any splits to the datasets. If you add for example:
splits = RandomSplitter()(X)
tfms = [None, [ToFloat(), ToNumpyTensor()]]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)
dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid)
print(dls.vars) # returns 1
I hope it's clearer now.
That's helpful. Thank you for the clarification @oguiza
@oguiza is there a way to load a dataset where the validation data is larger (longer) than the train data. I'm using your SlidingWindow
to split my training data into shorter arrays, for example from 2000 in total to five arrays of 1000, but I want to use the entire array (2000 datapoints) for validation. Is there a way to do that?
p.s. Thanks a lot for releasing the package!
Hi @ranihorev,
Thanks for your comment and question!
If I understand your question correctly, I don't think that makes much sense. With training, what you really want to do is to prepare the model to correctly predict new data that it will receive in the future. But the new data needs to be consistent with the data used in training.
You should use SlidingWindow
to prepare all data, and then decide how you split it between train and valid. Data in the future should have the same format.
I think that I haven’t explained myself well. But I was able to make it work, so thanks anyway!
On Tue, Dec 29, 2020 at 04:24 Ignacio Oguiza notifications@github.com wrote:
Hi @ranihorev https://github.com/ranihorev, Thanks for your comment and question! If I understand your question correctly, I don't think that makes much sense. With training, what you really want to do is to prepare the model to correctly predict new data that it will receive in the future. But the new data needs to be consistent with the data used in training. You should use SlidingWindow to prepare all data, and then decide how you split it between train and valid. Data in the future should have the same format.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timeseriesAI/tsai/issues/19#issuecomment-752057591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEC2ORMLPNUCUYMH2HFECOTSXHDA3ANCNFSM4TNOME5Q .
Sorry I didn’t understand you.
Hi, I have 3 separate pandas dataframe the train, validation and test time series. How do I correctly load them into the dataloaders for training? My codes below
dls = TSDataLoaders.from_dsets(train_dsets, valid_dsets, test_dsets, bs=[128, 128, 128])
Is this the right way?