timeseriesAI / tsai

Time series Timeseries Deep Learning Machine Learning Python Pytorch fastai | State-of-the-art Deep Learning library for Time Series and Sequences in Pytorch / fastai
https://timeseriesai.github.io/tsai/
Apache License 2.0
5.18k stars 646 forks source link

Loading train, validation and test data correctly #19

Closed dnth closed 3 years ago

dnth commented 3 years ago

Hi, I have 3 separate pandas dataframe the train, validation and test time series. How do I correctly load them into the dataloaders for training? My codes below

X_train, y_train = SlidingWindow(window_length, get_x=columns[:-1], get_y='Ah')(train_df)
X_valid, y_valid = SlidingWindow(window_length, get_x=columns[:-1], get_y='Ah')(valid_df)
X_test, y_test = SlidingWindow(window_length, get_x=columns[:-1], get_y='Ah')(test_df)
train_dsets = TSDatasets(X_train, y_train, tfms=tfms)
valid_dsets = TSDatasets(X_valid, y_valid, tfms=tfms)
test_dsets = TSDatasets(X_test, y_test, tfms=tfms)

dls = TSDataLoaders.from_dsets(train_dsets, valid_dsets, test_dsets, bs=[128, 128, 128])

Is this the right way?

Sandyxuxinxi commented 3 years ago

I think you need to pass in splits to the TSDatasets rather than generating three separate datasets.

I'm pretty confused too. I read this tutorial https://github.com/timeseriesAI/timeseriesAI/blob/master/tutorial_nbs/00c_Time_Series_data_preparation.ipynb but can't figure out how to get my pandas dataframe into the right format.

The dataframe just has two columns I care about: Usage_Mwh (Y) and unix_timestamp (X) image

Example:

    data = {"unix_timestamp":{"6":1451606400.0,"1558":1454284800.0,"3110":1456790400.0,"4662":1459468800.0,"6214":1462060800.0,"7766":1464739200.0,"9318":1467331200.0,"10870":1470009600.0,"12422":1472688000.0,"13974":1475280000.0,"15526":1477958400.0,"17078":1480550400.0,"18630":1483228800.0,"20182":1485907200.0,"21734":1488326400.0,"23286":1491004800.0,"24838":1493596800.0,"26390":1496275200.0,"27942":1498867200.0,"29494":1501545600.0,"31046":1504224000.0,"32598":1506816000.0,"34150":1509494400.0,"35702":1512086400.0,"37254":1514764800.0,"38806":1517443200.0,"40358":1519862400.0,"41910":1522540800.0,"43462":1525132800.0,"45014":1527811200.0,"46566":1530403200.0,"48118":1533081600.0,"49670":1535760000.0,"51222":1538352000.0,"52774":1541030400.0,"54326":1543622400.0,"55878":1546300800.0,"57430":1548979200.0,"58982":1551398400.0,"60534":1554076800.0,"62086":1556668800.0,"63638":1559347200.0,"65190":1561939200.0,"66742":1564617600.0,"68294":1567296000.0,"69846":1569888000.0,"71398":1572566400.0,"72950":1575158400.0,"74502":1577836800.0,"76054":1580515200.0,"77606":1583020800.0},"Usage_MWh":{"6":5.34858,"1558":3.78055,"3110":3.4831,"4662":3.74901,"6214":3.02347,"7766":7.63334,"9318":5.62975,"10870":5.51058,"12422":4.36067,"13974":3.29915,"15526":2.76066,"17078":2.94552,"18630":2.7777,"20182":2.76716,"21734":5.78573,"23286":4.8537129444,"24838":3.1271232778,"26390":2.8168842646,"27942":2.8774968882,"29494":2.8774968882,"31046":2.7846744079,"32598":2.8774968882,"34150":2.7846744079,"35702":2.8774968882,"37254":2.8774968882,"38806":2.5990294474,"40358":2.8774968882,"41910":2.5288257571,"43462":2.5724465738,"45014":3.3510199005,"46566":3.060722109,"48118":2.6989527056,"49670":2.5984900474,"51222":3.4421489889,"52774":3.4093083871,"54326":3.5249084516,"55878":3.1468401143,"57430":3.0175142015,"58982":3.3731491579,"60534":3.0313829708,"62086":3.1152347778,"63638":3.2681218106,"65190":3.4173398852,"66742":2.897951582,"68294":3.3056545,"69846":3.1457436,"71398":2.646408469,"72950":2.5245141129,"74502":2.7281552182,"76054":6.5999127071,"77606":7.7869288929}}
    df = pd.DataFrame.from_dict(data)

    window_length = 5
    X, y = SlidingWindow(window_length, get_x=['unix_timestamp'], get_y='Usage_MWh')(df)
    itemify(X, y)

    tfms  = [None, [ToFloat(), ToNumpyTensor()]]
    dsets = TSDatasets(X, y, tfms=tfms)
    dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid)
    print(dls.vars) # Fails
oguiza commented 3 years ago

Hi @dnth, thanks for your interest in the tsai library. In fastai, you usually create a datasets object with train and valid splits only, and then add a test set if necessary. In tsai it works the same way. You'd usually do:

dsets = TSDatasets(X, y, tfms=tfms, splits=splits) # splits for train and valid only
dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid, bs=64) # bs may be an int or a list of ints

and then you can add a test ds:

test_ds = dsets.valid.add_test(X_test, y_test)
test_dl = dls.valid.new(test_ds)

The reason why this is preferable is that sometimes the tfms require a setup on the train set (for example TSStandardize()). In this way, the same tfms applied to valid will be applied to the test set.

oguiza commented 3 years ago

Hi @Sandyxuxinxi, Your example is correct except that you are not passing any splits to the datasets. If you add for example:

splits = RandomSplitter()(X) 
tfms  = [None, [ToFloat(), ToNumpyTensor()]]
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)
dls   = TSDataLoaders.from_dsets(dsets.train, dsets.valid)
print(dls.vars) # returns 1

I hope it's clearer now.

dnth commented 3 years ago

That's helpful. Thank you for the clarification @oguiza

ranihorev commented 3 years ago

@oguiza is there a way to load a dataset where the validation data is larger (longer) than the train data. I'm using your SlidingWindow to split my training data into shorter arrays, for example from 2000 in total to five arrays of 1000, but I want to use the entire array (2000 datapoints) for validation. Is there a way to do that?

p.s. Thanks a lot for releasing the package!

oguiza commented 3 years ago

Hi @ranihorev, Thanks for your comment and question! If I understand your question correctly, I don't think that makes much sense. With training, what you really want to do is to prepare the model to correctly predict new data that it will receive in the future. But the new data needs to be consistent with the data used in training. You should use SlidingWindow to prepare all data, and then decide how you split it between train and valid. Data in the future should have the same format.

ranihorev commented 3 years ago

I think that I haven’t explained myself well. But I was able to make it work, so thanks anyway!

On Tue, Dec 29, 2020 at 04:24 Ignacio Oguiza notifications@github.com wrote:

Hi @ranihorev https://github.com/ranihorev, Thanks for your comment and question! If I understand your question correctly, I don't think that makes much sense. With training, what you really want to do is to prepare the model to correctly predict new data that it will receive in the future. But the new data needs to be consistent with the data used in training. You should use SlidingWindow to prepare all data, and then decide how you split it between train and valid. Data in the future should have the same format.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timeseriesAI/tsai/issues/19#issuecomment-752057591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEC2ORMLPNUCUYMH2HFECOTSXHDA3ANCNFSM4TNOME5Q .

oguiza commented 3 years ago

Sorry I didn’t understand you.