Problem with M4 data - Githubissues

thiyangt / seer

:chart_with_upwards_trend: Feature-based Forecast Model Selection (FFORMS) :clock1: :clock130: :clock10: :clock1030: :clock11: :clock1130: :clock12: :clock1230: :clock2: :clock230: :clock3: :clock330: :clock4: :clock430: :clock5: :clock530: :clock6: :clock630: :clock7: :clock730: :clock8: :clock830: :clock9: :clock930:

https://thiyangt.github.io/seer/

79 stars 11 forks source link

Problem with M4 data #3

Closed VicenteYago closed 5 years ago

VicenteYago commented 5 years ago

Hello, im testing the readme example with M4 data, exactly with hourly time series.

Here is my code:

library(seer)
library(Mcomp)
library(forecast)
data(M4)

# Get the first 3 time series with hourly frequency
M4_hourly <- Filter(function(l) l$period == "HOURLY", M4)[1:3]

# Convert data into msts object
hourlym4_msts <- lapply(M4_hourly, function(temp){
    temp$x <- convert_msts(temp$x, "hourly")
    return(temp)
})

#Extract features
hourlym4_msts.feats<-seer::cal_features(hourlym4_msts,
                                        seasonal=TRUE, m=24,lagmax=25L,
                                        database="M4", highfreq = TRUE)

At this point everything is ok, but when I run the following code:

tslist <- list(M4_hourly[[1]], M4_hourly[[2]])
accuracy_info <- fcast_accuracy(tslist=tslist,
                                models= c("arima","ets","rw","rwd", "theta", "nn"),
                                database ="M4", cal_MASE, h=6,
                                length_out = 1, fcast_save = TRUE)

I get this error:

errorSeer

the english traducción is : Error in value[3L] : object 'forecastARIMA' not found

I think the error is that the M4 series do not have a test set as you can see :

M4_hourly$H1$xx

NULL

More precisely i can traceback the error to the following functions:

in the file fcast_accuracy.R :

fcast_accuracy <- function(tslist, models = c("ets", "arima", "rw", "rwd", "wn",
                                         "theta", "stlar", "nn", "snaive", "mstlarima","mstlets", "tbats"), database
                           , accuracyFun, h, length_out, fcast_save){

  arima_models <- NA
  ets_models <- NA

  if (database == "other") {
   train_test <- lapply(tslist, function(temp){list(training=head_ts(temp,h), test=tail_ts(temp, h))})
  } else {
   train_test <- lapply(tslist, function(temp){list(training=temp$x, test=temp$xx)})
  }

  if ("arima"%in% models) {
    arima_cal <- lapply(train_test, accuracy_arima, function_name=accuracyFun, length_out=length_out)
    arima_models <- sapply(arima_cal, function(temp){temp$ARIMAmodel})
  }

........

In the initialization of the train_test inside the else block.

The problem is that initially I considered that the M4 is part of the Mcomp package, due to the fact that there is no warning about it in the readme example. It is not so strange to think that right? 😆

Finally I fix the error considering M4 as "other" database, after make some changes in the list of timeSeries :

tslist <- list(M4_hourly[[1]]$x, M4_hourly[[2]]$x)
accuracy_info <- fcast_accuracy(tslist=tslist,
                                models= c("arima","ets","rw","rwd", "theta", "nn"),
                                database ="other", cal_MASE, h=6,
                                length_out = 1, fcast_save = TRUE)

I hope it help another users of this excellent tool.

Thanks!.

-Vicente.

VicenteYago commented 5 years ago

I just realize that I should use library(M4comp2018) instead of library(Mcomp).

thiyangt commented 5 years ago

Hi Vicente,

Yes, you have spotted the issue correctly. I was aware of this problem and currently working on it. To calculate features you can use the code given in the readme file as it is.

The problem arises when calculating forecast accuracy measures for daily and hourly data, because daily and hourly series were converted into msts objects to handle multiple frequencies. I converted these time series into msts objects based on the length of the training part of the series. So, the following code can be used to MASE measures. I need to update the package along these dimensions for hourly and daily data.

library(M4comp2018)
data(M4)

# Get the first 3 time series with hourly frequency
M4_hourly <- Filter(function(l) l$period == "Hourly", M4)[1:3]
## Combine training period and test period into a single vector
hourly_data <- lapply(M4_hourly, function(temp){
  c(temp$x, temp$xx)
})
hourly_marbased_x <- lapply(hourly_data, function(temp){as.vector(temp)})

#Instead of using convert_msts, I used the following code here because I need to convert time series 
#into msts objects based on the length of the training period of the time series. convert_msts consider #the whole length of the time series. I need to work on this part and make it available on the package. 

## Based on the length of the training period convert series into msts objects
hourly_marbased_convertmsts <- lapply(hourly_marbased_x, function(temp){
  y <- head_ts(temp, 48)
  if (length(y) > 2*8766){
    forecast::msts(temp, seasonal.periods=c(24, 168, 8766))
  } else if( 2*168 < length(y) & length(y) <= 2*8766) {
    forecast::msts(temp, seasonal.periods=c(24, 168))
  } else {
    ts(temp, frequency=24)
  }
})

## Calculate forecast accuracy measures
MASE_m4 <- fcast_accuracy(hourly_marbased_convertmsts, 
                        models=c("rw", "rwd", "wn",
                                 "theta", "stlar", "nn", "snaive", "mstlets","mstlarima", "tbats"),
                        database="other",
                        h=48, accuracyFun=cal_MASE, length_out=1, fcast_save=TRUE)

Further, for daily and hourly series I didn't use arima and ets as a forecast-models, instead I used mstlarima and mstlets. mstlarima and mstlets handle multiple frequencies present in the data. In mstlarima and mstlets, first multiple seasonal decomposition is applied to time series and then snaive is used to forecast seasonal components, and in mstlarima, arima is used to forecast seasonally adjusted series while in mstlets, ets approach is used forecast seasonally adjusted series. I used h=48, that is the test length specified in the M4-competition for hourly series.

VicenteYago commented 5 years ago

Hi again Thiyanga ,

thanks for the recommendations about mstlarima and mstlets.

I don't understand very well the fact that the code you suggested is different from the other other code that uses convert_msts. From my point of view this code :

# Convert data into msts object
hourlym4_msts <- lapply(M4_hourly, function(temp){
    temp$x <- convert_msts(temp$x, "hourly")
    return(temp)
})

is already using the training set (temp$x) to convert the data to msts().

## Combine training period and test period into a single vector
hourly_data <- lapply(M4_hourly, function(temp){
  c(temp$x, temp$xx)
})

#---------->Is this line really necessary ?
hourly_marbased_x <- lapply(hourly_data, function(temp){as.vector(temp)})
#-----------

hourly_marbased_convertmsts <- lapply(hourly_marbased_x, function(temp){
  y <- head_ts(temp, 48)
  if (length(y) > 2*8766){
    forecast::msts(temp, seasonal.periods=c(24, 168, 8766))
  } else if( 2*168 < length(y) & length(y) <= 2*8766) {
    forecast::msts(temp, seasonal.periods=c(24, 168))
  } else {
    ts(temp, frequency=24)
  }
})

It really looks pretty the same to the convert_msts function:

function (y, category)  # convert_msts(temp$x, 'hourly')
{
    switch(category, daily = {
        if (length(y) <= 2 * 366) {
            y <- ts(y, frequency = 7)
        } else {
            y <- forecast::msts(y, seasonal.periods = c(7, 365.25))
        }
    }, hourly = {
        if (length(y) > 2 * 8766) {
            y <- forecast::msts(y, seasonal.periods = c(24, 168, 
                8766))
        } else if (2 * 168 < length(y) & length(y) <= 2 * 8766) {
            y <- forecast::msts(y, seasonal.periods = c(24, 168))
        } else {
            y <- ts(y, frequency = 24)
        }
    })
    return(y)
}
<bytecode: 0x13fb56af8>
<environment: namespace:seer>

so I don't see the difference.

But I know your are right because if I run this commands :

frequency(hourlym4_msts)  # 1  ---> Yearly  frecuency, I bet its wrong !
frequency(hourly_marbased_convertmsts) # returns max(24, 168) --->168 OK!?

the output doesn't match, so they are doing different things that I can't understand

and I have observed the fact that the results for the function fcast_accuracyare not the same. (the MASE measurements are now working good I guess)

And finally I intuit that the core idea that you want to explain me is related to the MASE error dependence at the multiple frequencies of a time series wich that are not working properly in the code that uses convert_msts and thats the reason of your suggested code.

Thanks for your patience ;)

thiyangt commented 5 years ago

Hi Vicente,

You are correct. Both functions convert the series into a mstsobjects based on the length of the training period. However, if you observe the output of hourlym4_msts the frequencies of the training period and test period are different. The hourlym4_msts is computed based on the following code

hourlym4_msts <- lapply(M4_hourly, function(temp){
    temp$x <- convert_msts(temp$x, "hourly")
    return(temp)
})

From the above code, only the training period is converted into a msts object. The test period remains as it is with the original frequency. So the calculation of forecast error measure in this circumstance is problematic.

In the code, I used to get the hourly_marbased_convertmsts first training and test periods are combined to produce a single time series and then the series is converted into msts object based on the length of the training period. Now the full series is converted to msts object not just the training period. Once you pass hourly_marbased_convertmst to the fcast_accuracy function the series is split into training period and test period, and do the remaining calculations (forecast-model is fitted to the training period and forecast accuracy is calculated over the test period). Now training period and test periods have the same frequencies and same object type (msts). Now MASE calculations are working properly.

Hope this helps!!

VicenteYago commented 5 years ago

Thanks a lot Thiyanga ! Finally I understood, of course it had helped me 👍

I have one question more about your great package.

Im working with multiple physical time series (for my undergraduate thesis project, in which I will especially thank you), most of them have a daily and yearly seasonality, should I have additional care in other aspects apart from adapt this code to these frequencies ?( seasonal.periods=c(24, 8766) ):

hourly_marbased_convertmsts <- lapply(hourly_marbased_x, function(temp){
  y <- head_ts(temp, 48)
  if (length(y) > 2*8766){
    forecast::msts(temp, seasonal.periods=c(24, 168, 8766))
  } else if( 2*168 < length(y) & length(y) <= 2*8766) {
    forecast::msts(temp, seasonal.periods=c(24, 168))
  } else {
    ts(temp, frequency=24)
  }
})

I mean details that are not obvious about the implementation of FFORMS that that could be conflicted with these frequencies. (really my data is sampled every 30 minutes but I don't care make it hourly)

thiyangt commented 5 years ago

Hi Vicente,

I think it a good idea to consider other frequencies as well if the series are long enough. For half hourly series daily seasonality (48), weekly seasonality (336) and yearly seasonality (17532).

VicenteYago commented 5 years ago

I Thiyanga, im experiencing some problems with the code you provided to calculate the MASE measures with the proposed methods mstlets , mstlarima and so.

For example:

library(seer)
library(M4comp2018) # M4
library(forecast) 
data(M4)

# Get the first 3 time series with hourly frequency
M4_hourly <- Filter(function(l) l$period == "Hourly", M4)[1:3]

## Combine training period and test period into a single vector
hourly_data <- lapply(M4_hourly, function(temp){
  c(temp$x, temp$xx)
})
hourly_marbased_x <- lapply(hourly_data, function(temp){as.vector(temp)})

## Based on the length of the training period convert series into msts objects
hourly_marbased_convertmsts <- lapply(hourly_marbased_x, function(temp){
  y <- head_ts(temp, 48)
  if (length(y) > 2*8766){ # si es anual
    forecast::msts(temp, seasonal.periods=c(24, 168, 8766))
  } else if (2*168 < length(y) & length(y) <= 2*8766) {
    forecast::msts(temp, seasonal.periods=c(24, 168))
  } else {
    ts(temp, frequency=24)
  }
})

## Calculate forecast accuracy measures
MASE_m4 <- fcast_accuracy(hourly_marbased_convertmsts, 
                        models=c("rw", "rwd", "wn",
                                 "theta", "stlar",
                                 "nn", "snaive",
                                 "mstlets","mstlarima",
                                 "tbats"),
                        database="other",
                        h=48, accuracyFun=cal_MASE, length_out=1, fcast_save=TRUE)

M4_hourly.features <- cal_features(M4_hourly, seasonal=TRUE,
                                   m=24, lagmax=25L, database="M4", highfreq = TRUE)

prep_tset <- prepare_trainingset(accuracy_set = MASE_m4,
                                 feature_set  = M4_hourly.features)

If I run this code I get an empty prep_tset.

I suspect the error is related with the output of MASE_m4:

$accuracy
           rw      rwd       wn     theta     stlar        nn    snaive   mstlets mstlarima
[1,] 2.451312 2.442885 2.432292 0.6809820 0.5392072 0.4728151 0.4978743 0.4489407 0.4627380
[2,] 1.639963 1.639459 1.405621 1.1011778 0.5648872 0.6288910 0.3915063 0.5321529 0.7292256
[3,] 1.966203 1.965833 1.859326 0.5476935 0.7706945 0.6291739 0.8954461 0.7362401 0.8294578
         tbats
[1,] 0.8636725
[2,] 1.1069969
[3,] 0.4593931

$ARIMA
[1] NA   <------- strange

$ETS
[1] NA  <------- strange

$forecasts
$forecasts$rw
      [,1] [,2] [,3]
 [1,]  684 3105 1518
 [2,]  684 3105 1518
 [3,]  684 3105 1518
 [4,]  684 3105 1518
 [5,]  684 3105 1518
.......

Any help about it ?

thiyangt commented 5 years ago

Hi Vicente,

In the output now you can't get the names of actual fitted ARIMA model and ETS model, because we haven't used arima and ets models for daily and hourly series. The output is not strange.

VicenteYago commented 5 years ago

But my issue is related with the empty prep_set I obtain when I run all the code, specifically this:

prepare_trainingset(accuracy_set = MASE_m4,
                                 feature_set  = M4_hourly.features)

(in my last message, first chunk of code)

I think its not normal and maybe a consecuence of working with timeSeries with high frecuency or by the fact that when arima/ets are not specified as methods prepare_trainingset don't do the job properly.

Concretely this is the code im referring:

library(seer)
library(M4comp2018) # M4
library(forecast) 
data(M4)

# Get the first 3 time series with hourly frequency
M4_hourly <- Filter(function(l) l$period == "Hourly", M4)[1:3]

## Combine training period and test period into a single vector
hourly_data <- lapply(M4_hourly, function(temp){
  c(temp$x, temp$xx)
})
hourly_marbased_x <- lapply(hourly_data, function(temp){as.vector(temp)})

## Based on the length of the training period convert series into msts objects
hourly_marbased_convertmsts <- lapply(hourly_marbased_x, function(temp){
  y <- head_ts(temp, 48)
  if (length(y) > 2*8766){ # si es anual
    forecast::msts(temp, seasonal.periods=c(24, 168, 8766))
  } else if (2*168 < length(y) & length(y) <= 2*8766) {
    forecast::msts(temp, seasonal.periods=c(24, 168))
  } else {
    ts(temp, frequency=24)
  }
})

## Calculate forecast accuracy measures
MASE_m4 <- fcast_accuracy(hourly_marbased_convertmsts, 
                        models=c("rw", "rwd", "wn",
                                 "theta", "stlar",
                                 "nn", "snaive",
                                 "mstlets","mstlarima",
                                 "tbats"),
                        database="other",
                        h=48, accuracyFun=cal_MASE, length_out=1, fcast_save=TRUE)

M4_hourly.features <- cal_features(M4_hourly, seasonal=TRUE,
                                   m=24, lagmax=25L, database="M4", highfreq = TRUE)

M4_hourly.features

prep_tset <- prepare_trainingset(accuracy_set = MASE_m4,
                                 feature_set  = M4_hourly.features)

head(prep_tset$trainingset)
head(prep_tset$modelinfo)

Can you confirm if this code produces an empty prep_tset ?

Thanks in advance :)

thiyangt commented 5 years ago

Use MASE_m4$accuracy, as the first argument, as follows.

prep_tset <- prepare_trainingset(accuracy_set = MASE_m4$accuracy,
                                 feature_set  = M4_hourly.features)

VicenteYago commented 5 years ago

i obtain this error:

prep_tset <- prepare_trainingset(accuracy_set = MASE_m4$accuracy,
                                 feature_set  = M4_hourly.features)

#Error: $ operator is invalid for atomic vectors

And makes sense because prepare_trainingset receive a list and extracts accuracy item from it:

function (accuracy_set, feature_set)  #prepare_trainingset
{
    accuracy_measures <- as.data.frame(accuracy_set$accuracy)
 ........
}

thiyangt commented 5 years ago

The easiest way to solve the issue is,

We create dummies for ARIMA and ETS ARIMA <- rep("arima", INCLUDE NUMBER OF TIME SERIES )

According to the example code ARIMA <- rep("arima", 3)

ETS <- rep("ets", INCLUDE NUMBER OF TIME SERIES)

According to the example code ETS <- rep("ets", 3)

Then, acc_list <- list(accuracy=accuracy_set$accuracy, ARIMA=ARIMA, ETS=ETS) prep_tset <- prepare_trainingset(accuracy_set = acc_list , feature_set = M4_hourly.features)

VicenteYago commented 5 years ago

Thanks, now works good.