Support for panel / longitudinal data

GitHunter0 commented 3 years ago

Hi folks, first congratulation for the amazing project, it is impressive how good and easy darts works!

I'm just missing the support for panel / longitudinal data. It is a very important feature in many science fields like Economics.

Do you plan to allow panel data in darts analysis?

Thanks a lot

hrzn commented 3 years ago

Hi @GitHunter0 , thanks for your feedback!

I'm not sure I fully understand your question, but darts supports already multivariate time series. Those are multi-dimensional time series, where each time point can "contain" multiple values. For instance you can create such a TimeSeries by calling TimeSeries.from_dataframe(df), where df would contain multiple columns (one per dimension). If you need to represent multiple such TimeSeries, some models can be fit() from a Sequence of TimeSeries (each having potentially multiple dimensions). Does this address your case? If not, I would be curious to know more about what you are trying to do.

GitHunter0 commented 3 years ago

@hrzn , thank you for the reply!

I'm sorry it took me so long to come back here.

Suppose I have a panel data like below but instead of just USA and IRELAND and 2 time periods, I have 200 countries and 1000 time periods for example. Is there an easy and efficient way to apply a single model to it and predict both 'gross_domestic_product' and 'inflation' by country? NOTE: We could also train a separate model for each country, however that would waste a lot of useful data from another countries.

import pandas as pd

dates_USA = pd.date_range("2000-01-01", periods=2, freq="MS")
dates_IRELAND = pd.date_range("2000-01-01", periods=2, freq="MS")

df = pd.DataFrame({'date': dates_USA.append(dates_IRELAND) ,
                   'country':['USA','USA', 'IRELAND', 'IRELAND'],
                   'gross_domestic_product': [1000, 2000, 1500, 2500],
                   'inflation': [1.7, 4.3, 5.8, 2.1] })
df

GitHunter0 commented 3 years ago

@hrzn , also I have 4 related questions that I would appreciate a lot if you could clarify them. I can open new separate issues if you find it more appropriate.

Consider the following example:

import pandas as pd
import numpy as np
import torch
import matplotlib.pyplot as plt

import darts
from darts import TimeSeries
from darts.utils.timeseries_generation import (gaussian_timeseries, linear_timeseries, sine_timeseries, constant_timeseries)
from darts.models import RNNModel, TCNModel, TransformerModel, NBEATSModel

torch.manual_seed(1); np.random.seed(1)  # for reproducibility

#
sine_series = sine_timeseries(length=50, freq='M')

df = (  sine_series.pd_dataframe() 
        .assign(time_trend = lambda x: range(len(sine_series))) 
        .rename(columns={'0': 'sine'}) )
df

df_darts = darts.timeseries.TimeSeries(df)
df_darts['sine']
df_darts['time_trend']

#
train, val = df_darts.split_after(pd.Timestamp('20030601'))

# 
my_model = RNNModel(
    model='LSTM',
    model_name='RNN_LSTM',
    input_chunk_length=12, 
    output_chunk_length=1,
    hidden_size=25,
    n_rnn_layers=1,
    dropout=0.4,
    batch_size=16,
    n_epochs=100, 
    optimizer_kwargs={'lr': 1e-3},
    log_tensorboard=True,
    random_state=42
)

train_list = [ train['sine'], train['time_trend'] ]

my_model.fit(series=train_list, verbose=True)

The model predicts sine series very well

ts_var = 'sine'
pred = my_model.predict(n=len(val)+10, series=train[ts_var])

df_darts[ts_var].plot(label='actual')
pred.plot(label='forecast')
plt.legend();

(1) However, the model predicts time linear trend series very badly, why? Can't it handle non-stationary series?

ts_var = 'time_trend'
pred = my_model.predict(n=len(val)+10, series=train[ts_var])
#
df_darts[ts_var].plot(label='actual')
pred.plot(label='forecast')
plt.legend();

(2) How darts model 'knows' which series ('sine' or 'time_trend' in this case) is the target series for the prediction since it does not rely on series names? Let me explain, I can pass a nameless new series which did not existed and was not trained, and still the model returns a prediction? So it is assuming "new_series" is "sine" or "time_trend", right? Which one? I don't get how the model discriminates that.

new_series = constant_timeseries(value = 3, length= 50, freq = 'M')
pred = my_model.predict(n=len(val)+10, series=new_series[0:int(len(train))])

new_series.plot(label='actual')
pred.plot(label='forecast')
plt.legend();

(3) All models that I tried made very inaccurate predictions not only for linear time trends but also for (nearly) constant series. Is there a special reason for this result and is there way to get around that?

(4) At last, increasing n_epochs sometimes make the predictions worse, pytorch has a method to retrieve the 'best' model, does darts have this capability too?

pennfranc commented 3 years ago

Hi @GitHunter0, thanks for getting in touch! What you're suggesting in your example with countries and predicting GDP is definitely supported by darts as a form of training a model on multiple time series. This would be a type of meta-learning. You can find more information in this blog post and in this example notebook.

Regarding your additional 4 questions:

(1) It can handle non-stationary series. But it looks like the two time series you use are on vastly different scales of magnitude, so the model might be having problems to generalize to both types. Also, this is an example of where meta learning doesn't really help, because learning about pure seasonality does not help with trend and vice versa. But of course I understand this is just an example. You should have more success when applying this to GDP, as long as you make sure to scale the data appropriately.

(2) The model's weights are determined by the training on all of the training time series, in your case the sine and the trend. In a meta learning framework this would correspond to the 'outer loop' of learning. However, a model's prediction is determined by both the weights and the input to the model. So once the training has finished, the model still needs to be primed on the specific time series it should predict by providing that series as input. This would correspond to the 'inner loop' of meta learning. You can use any series as input here, also the nameless one. But unless your model has been trained on similar data, you shouldn't expect the predictions to be accurate. Strictly speaking, it is not assuming any of the other two series. But the weights of the model were learned by using these two series, so it should perform reasonably well on similar series.

(3) This might be due to a combination of using different magnitudes and not training on series that are similar as those you provide as inputs, but without further details I can't be sure.

(4) We do indeed support this functionality. You can use the TorchForecastingModel.load_from_checkpoint() function with best=True. See https://github.com/unit8co/darts/blob/master/examples/05-RNN-examples.ipynb for an example.

I hope this helps! If you still have questions don't hesitate to ask. By the way, it looks like you are still using an old version of darts. Be sure to install the latest version to get all the new features such as probabilistic forecasts and filtering models!

GitHunter0 commented 3 years ago

Hey @pennfranc , thank you for the excellent answer! It was very helpful.

I made below a MWE of a panel data workflow for two countries (USA and Japan), using 2 predicted variables + 2 covariates, applying scaling transformations and in the end descaling transformations to obtain predictions for the original scale.

I have 2 questions if I may:

(1) In this case, in addition to covariates, are all target variables used as predictor variables to themselves? In other words, for example, will unemp and past values of gdp be used to predict gdp? '''

(2) This workflow would be messy with loops if instead of 2, we had 100 countries, so is there a better way to implement the same thing?

# - #
import pandas as pd
import numpy as np
import torch
import darts
from darts import TimeSeries, timeseries
from darts.utils.timeseries_generation import (datetime_attribute_timeseries, gaussian_timeseries, linear_timeseries, sine_timeseries, random_walk_timeseries, constant_timeseries)
from darts.models import RNNModel
from darts.dataprocessing.transformers import Scaler 
torch.manual_seed(1); np.random.seed(1)  

# - # Series Generation
n_train = 40
# Series of Inflation 
inflation_usa = random_walk_timeseries(length=n_train, std=0.1)
inflation_usa.plot()
inflation_japan = random_walk_timeseries(length=n_train, std=0.1)
inflation_japan.plot()
# Series of Interest Rate
ir_usa = gaussian_timeseries(length=n_train, std=0.2)
ir_usa.plot()
ir_japan = gaussian_timeseries(length=n_train, std=0.2)
ir_japan.plot()
# Series of Gross Domestic Product
gdp_usa = (
    sine_timeseries(length=n_train, value_frequency=0.1) 
    + linear_timeseries(length=n_train, end_value=5) 
    + random_walk_timeseries(length=n_train, std=0.2)
)          
gdp_usa.plot()
gdp_japan = gdp_usa/5 - inflation_japan
gdp_japan.plot()
# Series of Unemployment rate
unemp_usa = constant_timeseries(length=n_train, value=1) / gdp_usa
unemp_usa.plot()
unemp_japan = (
    2 + unemp_usa/10 + random_walk_timeseries(length=n_train, std=0.3)
)          
unemp_japan.plot()

# - # Targets
targets_usa = gdp_usa.stack(unemp_usa)
targets_japan = gdp_japan.stack(unemp_japan)
targets_list = [targets_usa, targets_japan]

# - # Covariates
covariates_usa = inflation_usa.stack(ir_usa)
covariates_japan = inflation_japan.stack(ir_japan)
covariates_list = [covariates_usa, covariates_japan]

# - # Scale
targets_transformer = Scaler()
targets_list_scaled = targets_transformer.fit_transform(targets_list)
# targets_transformer.inverse_transform(targets_list_scaled)
Scaler().fit_transform(targets_list[1])
#
covariates_transformer = Scaler()
covariates_list_scaled = covariates_transformer.fit_transform(covariates_list)
# covariates_transformer.inverse_transform(covariates_list_scaled)

# - # Fit Model
my_model = RNNModel(
    model='LSTM',
    model_name='LSTM_1',
    input_chunk_length=10,
    output_chunk_length=12,
    n_epochs=400,
    random_state=42
)

my_model.fit(series=targets_list_scaled, 
             covariates=covariates_list_scaled, 
             verbose=True)

# - # Predict
targets_usa_scaled = targets_list_scaled[0]
covariates_usa_scaled = covariates_list_scaled[0]
pred_usa_scaled = my_model.predict(n=12,
                                   series=targets_usa_scaled, 
                                   covariates=covariates_usa_scaled, 
                                   verbose=True)
targets_usa_scaled.plot(label='actual')
pred_usa_scaled.plot(label='forecast')

targets_japan_scaled = targets_list_scaled[1]
covariates_japan_scaled = covariates_list_scaled[1]
pred_japan_scaled = my_model.predict(n=12,
                                     series=targets_japan_scaled, 
                                     covariates=covariates_japan_scaled, 
                                     verbose=True)
targets_japan_scaled.plot(label='actual')
pred_japan_scaled.plot(label='forecast')

# - # Prediction for the original series scales
pred_list_scaled = [pred_usa_scaled, pred_japan_scaled]
pred_list = targets_transformer.inverse_transform(pred_list_scaled)
# USA prediction
targets_list[0].plot(label='actual')
pred_list[0].plot(label='forecast')
# Japan prediction
targets_list[1].plot(label='actual')
pred_list[1].plot(label='forecast')

GitHunter0 commented 3 years ago

Hi @hrzn !

I had built a very nice app using darts, however the new API broke it because there is no longer the output_chunk_length parameter in RNNModel() . I used to get good predictions tweaking that parameter but now I can't get similar performance.

Please, how can I achieve the same effect of setting output_chunk_length=12 for example with the new API?

hrzn commented 3 years ago

Hi @hrzn !

I had built a very nice app using darts, however the new API broke it because there is no longer the output_chunk_length parameter in RNNModel() . I used to get good predictions tweaking that parameter but now I can't get similar performance.

Please, how can I achieve the same effect of setting output_chunk_length=12 for example with the new API?

Could you try using BlockRNNModel instead? It is basically the old RNNModel renamed.

GitHunter0 commented 3 years ago

Thank you @hrzn ! that was exactly what I needed.

unit8co / darts

Support for panel / longitudinal data #356