Implementing SVR and MLR

lv-bakker commented 4 years ago

Dear nredell,

Your package has been of great help for my Master Thesis so far. As part of my research I need to compare a number of different ML algorithms with respect to forecasting on grouped time series. The algorithms I wanted to compare were xGBoost, MLR, and SVR (which were all mentioned in previous studies regarding this subject). I followed your "grouped_forecast" guide and was able to achieve proper results with the xGBoost model you implemented. However, with trying to implement both SVR and MLR I get the following error:

Error in model.frame.default(formula = log(outcome_col) ~ ., data = data_train_partition,  : 
  variable lengths differ (found for 'houseid')

I tried the code below for with one of the model functions for either MLR or SVR.

indices <- 1:nrow(data)
train_indices <- sample(1:nrow(data), ceiling(nrow(data) * .8), replace = FALSE)
test_indices <- indices[!(indices %in% train_indices)]

data_train_partition <- data[train_indices, -(outcome_col)]

data_test_partition <- data[test_indices, -(outcome_col)]

model_function <- function(data, outcome_col = 1) {

  model<- lm(log(outcome_col)~., data=data_train_partition)
  return(model)
}

model_function <- function(data, outcome_col = 1) {

  type = "eps-regression"
  u = -2
  gam=10^{u}; w=4.5
  cost=10^{w}

  model <- svm(outcome_col ~ ., data = data_train_partition, type=type,kernal="rbf", gamma=gam,cost=cost)

  return(model)
}

My dataset looks like this with 44 different houses in total.

    Date    Consumption houseid Elabel  Hsize   Temp
2019-01-01T00:00:00Z    0   1   1   211 8.5
2019-01-01T01:00:00Z    0   1   1   211 8.6
2019-01-01T02:00:00Z    0   1   1   211 8.5
2019-01-01T03:00:00Z    0   1   1   211 8.2
2019-01-01T04:00:00Z    0   1   1   211 8.7
2019-01-01T05:00:00Z    0   1   1   211 8.9

I'm not really well-versed in this particular area so I'm probably doing something really simple wrong, but I can't seem to figure it out.

Your help would be greatly appreciated.

Thanks in advance

Leon Bakker

nredell commented 4 years ago

Thanks for the kind words. One note: If this is for your thesis and you're using grouped time series, I plan to put out 2 short vignettes on forecast combinations and error metrics which will be slightly changing when v0.9.0 gets released here shortly. Your error metrics probably won't change qualitatively between models, but I'll be matching the ongoing M5 forecasting competition in terms of assessing accuracy--calculating error metrics for each individual time series first and then aggregating these for overall accuracy--so this is just a heads up if you're in the middle of this.

I think I see the problem. outcome_col is an integer so log(outcome_col) is literally log(1) here. Go ahead and simply hard-code the outcome as "my_outcome_name" (or whatever it is) on the left hand side of the ~. Should work.

lv-bakker commented 4 years ago

Thanks for the update on the vignettes. Even with the suggested changes it still doesn't seem to work, the whole error looks like this. I also get the same error when trying to implement SVR like I've shown above (which doesn't include log).

Error in model.frame.default(formula = log(data$Consumption) ~ ., data = data_train_partition,  : 
  variable lengths differ (found for 'houseid')
In addition: Warning message:
In FUN(X[[i]], ...) :
  A model returned class 'try-error' for validation window 1

Would you have another idea of how to fix this? I really don't know what to try anymore. I've provided my R-scripts below with the source Excel file if that is of any help.

Thanks for your assistance.

nredell commented 4 years ago

I'll give it a go later today.

lv-bakker commented 4 years ago

Thanks a lot!

nredell commented 4 years ago

I noticed a couple things. First, a linear regression with lm() won't work well in the grouped time series case because it cannot handle missing data, and there is a lot of missing data after creating lagged features. You could, however, run lm(..., na.exclude = TRUE) to simply exclude those cases if you're down with the amount of lost data for the particular problem.

Second, when creating a train and test partition--which is really only needed if tuning hyper-parameters--either do it (a) before create_lagged_df() or (b) inside of the model training function...not between create_lagged_df() and the model training function. In your case, I'd drop the train/test split until everything is working.

For the svm() model, the following block of code runs, but it's taking awhile on my machine:

model_function <- function(data) {

  type = "eps-regression"
  u = -2
  gam=10^{u}; w=4.5
  cost=10^{w}

  model <- svm(Consumption ~ ., data = data, type = type, kernal = "rbf", gamma = gam, cost = cost)

  return(model)
}

Finally, having ~200 validation windows from create_windows() means training 200+ models for each forecast horizon (~600 models in your case). I would probably start with 0, 1, or a few select validation windows to get a sense of how well the model is predicting on unseen data.

lv-bakker commented 4 years ago

Thanks a lot for the help. I will have to have a look at using MLR for this specific goal.

nredell / forecastML

Implementing SVR and MLR #26