To get a more sensible output when using the infer_mode for predicting values further into the future than what the data source input to .predict() offers.
How
Reformat output timestamps so that they correspond directly with the forecasted quantity.
Add confidence (and bounds if applicable) to all n predicted data points (n is defined at training time with the nr_predictions argument). It's important to note that the statistical guarantees offered by the conformal prediction framework do not hold for these additional data points. Doing so is an open research question.
Example
Current behavior
To illustrate the changes, assume a predictor with nr_predictions=4 ; window=3 and that the latest timestamp in the when_data dataframe is 1342051200 with 7862400-wide intervals (that is, quarterly frequency).
Note that the output contains nr_predictions=4 forecasted points for each group/series (as determined by the group_by: Country column), and that the latest timestamp in each row (1349913600=October 2012) corresponds to the quarter that follows the latest point in the input data frame (1342051200=July 2012).
However, this format is confusing because of the mismatch between timestamps and the forecasts. Why? Well, because the predictions in the Traffic key start from 1349913600 and assume regularly spaced intervals moving forward, but crucially, MDB Native is not returning what those timestamps are.
Another undesirable fact: confidence and confidence ranges are available only for the first forecast in each row.
New behavior
The output with the changes introduced in this PR:
We see that now the timestamps do match the predicted quantity, starting from 1349913600=October 2012 all the way to 1373500800=July 2013, forecasting a full year as specified in learn.
Additionally, both confidence and confidence ranges have been turned into arrays that have this information for all 4 predictions, though guarantees made by the ICP framework do not hold for any predictions other than the very first one, for each row.
Lastly, infer mode assumes there is no true data available to compare all forecasts, so the {target}_anomaly and __observed_{target} keys operate on the assumption that the last seen value will hold, and subsequently apply the criterion only for the first forecast, hence why those keys still have lists of values, rather than nested lists. This bit is important for stream integrations to work.
Why
To get a more sensible output when using the
infer_mode
for predicting values further into the future than what the data source input to.predict()
offers.How
n
predicted data points (n
is defined at training time with thenr_predictions
argument). It's important to note that the statistical guarantees offered by the conformal prediction framework do not hold for these additional data points. Doing so is an open research question.Example
Current behavior
To illustrate the changes, assume a predictor with
nr_predictions=4 ; window=3
and that the latest timestamp in thewhen_data
dataframe is1342051200
with7862400
-wide intervals (that is, quarterly frequency).The
predict()
output previous to this PR:Note that the output contains
nr_predictions=4
forecasted points for each group/series (as determined by thegroup_by: Country
column), and that the latest timestamp in each row (1349913600=October 2012
) corresponds to the quarter that follows the latest point in the input data frame (1342051200=July 2012
).However, this format is confusing because of the mismatch between timestamps and the forecasts. Why? Well, because the predictions in the
Traffic
key start from1349913600
and assume regularly spaced intervals moving forward, but crucially, MDB Native is not returning what those timestamps are.Another undesirable fact: confidence and confidence ranges are available only for the first forecast in each row.
New behavior
The output with the changes introduced in this PR:
We see that now the timestamps do match the predicted quantity, starting from
1349913600=October 2012
all the way to1373500800=July 2013
, forecasting a full year as specified inlearn
.Additionally, both confidence and confidence ranges have been turned into arrays that have this information for all
4
predictions, though guarantees made by the ICP framework do not hold for any predictions other than the very first one, for each row.Lastly,
infer mode
assumes there is no true data available to compare all forecasts, so the{target}_anomaly
and__observed_{target}
keys operate on the assumption that the last seen value will hold, and subsequently apply the criterion only for the first forecast, hence why those keys still have lists of values, rather than nested lists. This bit is important for stream integrations to work.