philipperemy / cond_rnn

Conditional RNNs for Tensorflow / Keras.
MIT License
222 stars 33 forks source link

shape of the input tensor when using conditional RNN #12

Closed reyhanhgh closed 3 years ago

reyhanhgh commented 3 years ago

Thanks for sharing this interesting project. I have been trying to understand how to shape the input tensor using conditional RNN in Keras but I am still very unclear about how to present the input data in the correct shape.

I am working on 10 stations (num_stations=10). For each station, I have one-year (timesteps = 365) records of three continuous variables: A and B are predictive variables (thus, dim_input=2) and C is the target variable. For each station, I also have two conditions (a categorical condition that can take 5 classes (dim_cond1 = 5) and a continuous condition (dim_cond2 = 1)).

What I want to do is to have a model that is trained based on information from the ten stations taking into account the two conditions for every station (I call this mode the global model).

What I am confused about is the shape of the input tensor that I should feed into the model. I know that for an LSTM model that is trained based on the time series of only one station (I call this model the local model), the shape of the input tensor takes the form [batch_size, timesteps, input_dim]. For the local model, I am able to use a generator that extracts and yields a tuple (samples, targets), where samples (one batch of input data) and targets are from one station.

But the global model should sample a part from the data of each station before completing each epoch.

Appending the time series of different stations on top of the other and iterating through all the rows, does not make sense in the context of my problem since the date goes, for instance, from 2020-12-29 (a winter day) to 1986-07-01 (a summer day).

I have trouble in understanding how the passage and batch extraction from one station to another should take place in the global model. Probably the two following possible solutions:

1- To be able to use a generator similar to that of the local model: create a training loop on stations and train on data from each station one after another and update the weights but reset the state to differentiate between time series.

2- Otherwise, is there a way to build a generator that could somehow yield a batch from all stations?

Thanks for your thoughts.

philipperemy commented 3 years ago

@reyhanhgh I have added an example for your case: https://github.com/philipperemy/cond_rnn/blob/master/examples/dummy_stations_example.py.

Let me know if it solves your problem.

reyhanhgh commented 3 years ago

@philipperemy Hi Philippe and thank you very much for the very complete example that you provided. It is far more clear now. Just a little question: I just came across some stations with NA values. So, if I want to use the full-record period for these stations, the length of time series will not be the same. For instance, I have one station with 200 days of data (in my data time series are far longer), one with 270, the other with 300 and etc ... In such case, should I follow the temperature example that you recently added?

Many thanks for your follow-up !

philipperemy commented 3 years ago

@reyhanhgh yeah, you can follow the example that I recently added. Or you can interpolate the missing values. Best is to backfill them. If you use pandas it's a one-liner (look for bfill).

reyhanhgh commented 3 years ago

@philipperemy Thanks Philippe! I have a concern regarding imputing the missing data: if I use some sort of algorithms to fill the NA, then use DL to predict them, isn't it data leakage ??

philipperemy commented 3 years ago

If you forward the previous values then no. I think that is called a forward fill. Use this is if you want to be rigorous and have no leakage ;)

reyhanhgh commented 3 years ago

@philipperemy : Thank you very much Philippe ! I think I have to finally go with the approach you gave for time series of different sizes since the physics of the target variable (which includes all the NA) is tricky : the target is debit of bassins. So imagine between the NA days we have, for instance, flooding (target value close to the max of the target) and several days before and after (non-NA days) we observed 0 as debit. I don't know how I could reproduce such eventual events using imputation algorithms ...

I'll try the approach in the other example and come back to you. Thank you again for your help !

philipperemy commented 3 years ago

I'll close this issue for now. Feel free to comment on how it goes.