tidyverts / fable

Tidy time series forecasting
https://fable.tidyverts.org
GNU General Public License v3.0
564 stars 66 forks source link

Interpolation of irregular time series #256

Open jens-daniel-mueller opened 4 years ago

jens-daniel-mueller commented 4 years ago

This issue refers to a communicatio with Rob Hyndman started on stackoverflow.

https://stackoverflow.com/questions/61078446/interpolation-of-irregular-time-series-with-r

I'm looking for a way to interpolate irregular time series data where the timestamp is POSIXct (rather than a date).

Rob proposed following solution, which does not seem to work with the example df I created.

library(tidyverse)
library(tsibble)
library(fable)

df <- tibble(date = as.POSIXct(c("2000-01-01 00:00", "2000-01-02 01:00", "2000-01-05 00:00")),
             value = c(1,NA,2)) %>%
  as_tsibble(index = date) %>%
  fill_gaps()

df %>%
  model(naive = ARIMA(value ~ -1 + pdq(0,1,0) + PDQ(0,0,0))) %>%
  interpolate(df)

# Error in UseMethod("interpolate") : 
#  no applicable method for 'interpolate' applied to an object of class "null_mdl"
# In addition: Warning messages:
# 1: It looks like you're trying to fully specify your ARIMA model but have not said if a constant # should be included.
# You can include a constant using `ARIMA(y~1)` to the formula or exclude it by adding `ARIMA(y~0)`. 
# 2: 1 error encountered for naive
# [1] Could not find an appropriate ARIMA model.
# This is likely because automatic selection does not select models with characteristic roots that may # be numerically unstable.
# For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots

Thanks for taking a look again!

mitchelloharawild commented 4 years ago

With only two observations I think there is some issues with computing the variance for the model.

jens-daniel-mueller commented 4 years ago

When as.Date() is used to create the date vector, the fill_gaps() function expands the number of rows from 3 to 5 (daily grid). In this case the interpolation works with only two observations.

When as.POSIXct() is used to create the date vector, thefill_gaps() function expands the number of rows from 3 to 97 (hourly grid). In this case the interpolation fails as outlined in the initial comment.

This leads me to the guess, that it is not the variance of the model that causes the problem. However, this is just a guess.

In addition, I'm skeptical about the fill_gaps() approach, because this will propably cause very large NA gaps when interpolating time series that cover several years with one observation every few days, but still with resolution of seconds on the date vector. Is a direct interpolation to the desired time stamp possible?

mitchelloharawild commented 4 years ago

I still suspect it is the variance for this particular case, but I'll need to look into it more. The model returned from stats::arima() has NaN variance, likely due to the small number of observed values.

As for your second question, you can definitely do direct interpolation of specific time stamps. However it depends on the model that you are using. The ARIMA() model requires equal spacing between observations, and so to interpolate something between two times you'll need to construct equally spaced intermediate values as is done with fill_gaps(). As an example (and the only model I think supports it so far), TSLM() supports arbitrary spacing between observations. So if you use TSLM() you can specify arbitrary time stamps to interpolate.

jake-mason commented 4 years ago

@mitchelloharawild, what would the call to TSLM look like if you wanted to do a linear interpolation between those points? The ARIMA approach outlined above works well in certain instances, but not in the generic case described below, where one entity (key == 'A') has missing values and the other (key == 'B') consists entirely of three consecutive months of complete data:

library(tidyverse)
library(tsibble)
library(fable)

df <- data.frame(
  key = c(rep('A', 3), rep('B', 3)),
  date = yearmonth(as.Date(c('2019-01-01', '2019-02-01', '2019-04-01', '2019-01-01', '2019-02-01', '2019-03-01'))),
  value = c(5, 7, 1, 25, 26, 28)
) %>%
  as_tsibble(index = date, key = key) %>%
  fill_gaps()

df %>%
  model(naive = ARIMA(value ~ -1 + pdq(0,1,0) + PDQ(0,0,0))) %>%
  interpolate(df)
Error: Problem with `mutate()` input `interpolated`.
✖ no applicable method for 'interpolate' applied to an object of class "null_mdl"
ℹ Input `interpolated` is `map2(naive, new_data, interpolate, ...)`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: It looks like you're trying to fully specify your ARIMA model but have not said if a constant should be included.
You can include a constant using `ARIMA(y~1)` to the formula or exclude it by adding `ARIMA(y~0)`. 
2: 1 error encountered for naive
[1] Could not find an appropriate ARIMA model.
This is likely because automatic selection does not select models with characteristic roots that may be numerically unstable.
For more details, refer to https://otexts.com/fpp3/arima-r.html#plotting-the-characteristic-roots

The TSLM approach with a trend() special doesn't give an exact linear interpolation:

df %>%
  model(naive = TSLM(value ~ trend())) %>%
  interpolate(df)
# A tsibble: 7 x 3 [1M]
# Key:       key [2]
  key       date value
  <fct>    <mth> <dbl>
1 A     2019 Jan  5   
2 A     2019 Feb  7   
3 A     2019 Mar  3.29      <- this should be 4
4 A     2019 Apr  1   
5 B     2019 Jan 25   
6 B     2019 Feb 26   
7 B     2019 Mar 28   

I'm not confident trend() is the right special but having trouble grasping what it should be.