mindsdb / mindsdb_native

Machine Learning in one line of code
http://mindsdb.com
GNU General Public License v3.0
37 stars 28 forks source link

Improved time series predictions for future timestamps #523

Closed paxcema closed 3 years ago

paxcema commented 3 years ago

Why

To get a more sensible output when using the infer_mode for predicting values further into the future than what the data source input to .predict() offers.

How

Example

Current behavior

To illustrate the changes, assume a predictor with nr_predictions=4 ; window=3 and that the latest timestamp in the when_data dataframe is 1342051200 with 7862400-wide intervals (that is, quarterly frequency).

The predict() output previous to this PR:

{'T': [[1326326400.0, 1334188800.0, 1342051200.0, 1349913600.0],
  [1326326400.0, 1334188800.0, 1342051200.0, 1349913600.0],
  [1326326400.0, 1334188800.0, 1342051200.0, 1349913600.0],
  [1326326400.0, 1334188800.0, 1342051200.0, 1349913600.0]],
 'Country': ['Japan', 'NZ', 'UK', 'US'],
 'Traffic': [[115332, 104071, 71530, 94848],
  [324261, 259259, 270811, 325602],
  [174730, 195126, 97320, 105799],
  [117756, 136044, 104274, 110466]],
 '__observed_Traffic': [101900, 319840, 101690, 106540],
 'Traffic_confidence': [0.08, 0.08, 0.08, 0.08],
 'Traffic_confidence_range': [[86393.34343434343, 144270.65656565657],
  [295322.34343434346, 353199.65656565654],
  [145791.34343434343, 203668.65656565657],
  [88817.34343434343, 146694.65656565657]],
 'Traffic_anomaly': [False, False, True, False]}

Note that the output contains nr_predictions=4 forecasted points for each group/series (as determined by the group_by: Country column), and that the latest timestamp in each row (1349913600=October 2012) corresponds to the quarter that follows the latest point in the input data frame (1342051200=July 2012).

However, this format is confusing because of the mismatch between timestamps and the forecasts. Why? Well, because the predictions in the Traffic key start from 1349913600 and assume regularly spaced intervals moving forward, but crucially, MDB Native is not returning what those timestamps are.

Another undesirable fact: confidence and confidence ranges are available only for the first forecast in each row.

New behavior

The output with the changes introduced in this PR:

{'T': [[1349913600.0, 1357776000.0, 1365638400.0, 1373500800.0],
  [1349913600.0, 1357776000.0, 1365638400.0, 1373500800.0],
  [1349913600.0, 1357776000.0, 1365638400.0, 1373500800.0],
  [1349913600.0, 1357776000.0, 1365638400.0, 1373500800.0]],
 'Country': ['Japan', 'NZ', 'UK', 'US'],
 'Traffic': [[115332, 104071, 71530, 94848],
  [324261, 259259, 270811, 325602],
  [174730, 195126, 97320, 105799],
  [117756, 136044, 104274, 110466]],
 '__observed_Traffic': [101900, 319840, 101690, 106540],
 'Traffic_confidence': [[0.08, 0.08, 0.08, 0.08],
  [0.08, 0.08, 0.08, 0.08],
  [0.08, 0.08, 0.08, 0.08],
  [0.08, 0.08, 0.08, 0.08]],
 'Traffic_confidence_range': [[[86393.34343434343, 144270.65656565657],
   [75132.34343434343, 133009.65656565657],
   [42591.343434343435, 100468.65656565657],
   [65909.34343434343, 123786.65656565657]],
  [[295322.34343434346, 353199.65656565654],
   [230320.34343434346, 288197.65656565654],
   [241872.34343434346, 299749.65656565654],
   [296663.34343434346, 354540.65656565654]],
  [[145791.34343434343, 203668.65656565657],
   [166187.34343434343, 224064.65656565657],
   [68381.34343434343, 126258.65656565657],
   [76860.34343434343, 134737.65656565657]],
  [[88817.34343434343, 146694.65656565657],
   [107105.34343434343, 164982.65656565657],
   [75335.34343434343, 133212.65656565657],
   [81527.34343434343, 139404.65656565657]]],
 'Traffic_anomaly': [False, False, True, False]}

We see that now the timestamps do match the predicted quantity, starting from 1349913600=October 2012 all the way to 1373500800=July 2013, forecasting a full year as specified in learn.

Additionally, both confidence and confidence ranges have been turned into arrays that have this information for all 4 predictions, though guarantees made by the ICP framework do not hold for any predictions other than the very first one, for each row.

Lastly, infer mode assumes there is no true data available to compare all forecasts, so the {target}_anomaly and __observed_{target} keys operate on the assumption that the last seen value will hold, and subsequently apply the criterion only for the first forecast, hence why those keys still have lists of values, rather than nested lists. This bit is important for stream integrations to work.