mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

How to use mlr for time series regression #309

Closed weldebrhan closed 5 years ago

weldebrhan commented 9 years ago

Hi All, I want use Mlr for time series regression, including hyperparameter optimization. R has number of packages for time series regression like the forecast package and other for multivariate regression problem (http://cran.r-project.org/web/views/TimeSeries.html).

can we use mlr package for time series regression?

thanks

Weldebrhan

berndbischl commented 9 years ago

This was also discussed a bit in #301 Please read this too.

So, currently the answer is "no, but maybe an extension is possible" To give a better answer we really need to know what kind of models you are interested in. Stuff like ARIMA and so on, or more a reduction to non-linear machine learning techniques which would then work also on more than one time series?

I guess the latter was requested in #301.

rpinsler commented 9 years ago

It would also be helpful to have a holdout variant suitable for time series, e.g. a 2/3 split would partition the data into the first 2/3 and the last 1/3, assuming that the data is ordered by time. Maybe you can add an additional parameter "keep.order" or something for the holdout function?

berndbischl commented 9 years ago

Sure. But I would also really like to have derived subclass for a task. So a TimeSeriesRegrTask. Only that makes sense.

I currently cannot do this myself, maybe after lectures end here in about 6 weeks. But I am happy to support anyone who wants to help and to try doing this.

rpinsler commented 9 years ago

Well, in our case we actually formulate it as a time series classification problem. So, maybe it is easier to simply enhance the houldout/resampling function.

berndbischl commented 9 years ago

Its not IMHO a question what "somehow works" but to get the right structure. Of course we would also add TimeSeriesClassifTask. If we have the one, the other is trivial.

berndbischl commented 8 years ago

@florianfendt will later work on this in a larger, already planned project, so we dont need this issue now

SteveBronder commented 8 years ago

@berndbischl I have about two weeks between work and school to work on this, is there a place on here I can see what you have do so far wrt time series methods?

berndbischl commented 8 years ago

@berndbischl I have about two weeks between work and school to work on this, is there a place on here I can see what you have do so far wrt time series methods?

not much. we are currently at the mlr workshop. any thing you have in mind? would you like working on this a bit? could you outline your ideas?

SteveBronder commented 8 years ago

@berndbischl

We can break time series problems into

  1. Implementing Econometric metrics such as MASE and RelMedAE
  2. Model evaluation w.r.t. time
    • Ex. Windowing like in createTimeSlices())
  3. Allowing / creating lagged variables for use in arbitrary models
    • Ex. Simple vector auto-regression schemes, making lagged variables in the data generating process
  4. Import time series models such as Arima, Holt-winters, etc.
    • Ideally importing a large amount of the forecast package

I started looking at your code base today, but I believe these are in the order of difficulty.

I would not think of time series methods as a completely different set of methods, but instead a compliment of existing methods. For example, in makeRegrTask() we can have a parameter, index, that allows the user to specify that their data has a time element. From there preprocessing methods can have a parameter lags that let the user specify the maximum number of lags allowed in their data set, which variables to lag, the earliest lag ( i.e. removing today's values if they will not be known at prediction time), etc.

berndbischl commented 8 years ago

. For example, in makeRegrTask() we can have a parameter, index, that allows the user to specify that their data has a time element.

yes. this seems like a good idea. but still, i would create a new task for this. if you really think this is an extension of a RegrTask, do the usual thing in OO: derive from the class. there is no downside, but still a good separation?

what is your personal plan regarding this? do you want to help? also @florianfendt will work on this soon.

SteveBronder commented 8 years ago

@berndbischl If @florianfendt has a plan for this I am all ears.

do the usual thing in OO

This is the best option.

This week: Made a fork, going to look through caret and forecast and see how they do these methods. Then going to go through mlr's base and map where the differences and difficulties may arise.

The metrics should not be difficult, but I'm not sure if the chicken or the egg or the other egg comes first w.r.t. models, lags, and windowing. @florianfendt is what I'm talking about similar to your plan?

SteveBronder commented 8 years ago

@berndbischl

I have some very good news!

I am going to try to implement time series methods in MLR as my Master's thesis.

I would like some opinions on how to deal with the dates.

It would be best for makeTimeRegrTask() to use xts objects as the data parameter because xts is a consistent and widely used format for working with time series in R. However, many of the underlying parts of the task functions access the data through things such asdata[[target]], which throws an error for xts objects since they are based on matrices.

My compromise for this is to first use assertClass(data,"xts") so the user has to bring in a xts object, but then convert it to a data.frame before going through everything else. There are two ways to do this.

a. Make the dates their own variable data <- data.frame(dates = index(data), coredata(data))

b. Make the dates into the rownames data <- data.frame(row.names = index(data), coredata(data))

Each of these have their own positives and negatives.

(a) fails checkTaskData() as POSIXt variable types trigger

stopf("Unsupported feature type (%s) in column '%s'.", class(x)[1L], cn)

So I can add an if(is.POSIXt(x)) statement to checkTaskData(), though I'm not sure if this will effect other parts of the code base.

(b) is a handful because the row names do not stay POSIXt type. This means a lot of unnecessary conversion. For instance, to print a time tasks From: and To: for dates I first have to check whether the rownames can be converted to POSIXt, convert them to POSIXt, and then do the print. This check conversion would have to apply everywhere.

Does anyone have an opinion on which is better? I think adding a date column would be the best. I am very open to other suggestions.

PhilippPro commented 8 years ago

Hi, that sounds really cool.

Thinking a bit further:

berndbischl commented 8 years ago

I am going to try to implement time series methods in MLR as my Master's thesis.

this sounds quite nice. it is important though that, given that you want this merged into the master / a future release we talk about the design then.

so the user has to bring in a xts object, but then convert it to a data.frame before going through everything else

that currently might really be the best option so you dont have to change mlr in too many places.

There are two ways to do this.

please really use an extra column for the date. that seems much better. this will then also not affect other parts of the code.

fails checkTaskData() as POSIXt variable types trigger

simply overwrite that through OO / S3. this is no problem.

berndbischl commented 8 years ago

PS: and we really have to talk with @florianfendt about this

SteveBronder commented 8 years ago

@PhilippPro Thanks! I am very excited about this project and think mlr is bar none the best place to merge econometrics and machine learning.

Is xts really that standard?

xts provides a rigorous structure for handling dates and I think the cost of the user learning how to make a date index is negligible. When we get to data pre-processing, xts's built in lag and difference function are going to be really nice to have access to.

Maybe have the option for a standardized conversion of the date into several columns

This would be nice for constructing lags in preprocessing, if the user wanted monthly or yearly lags, etc. I have not yet build that part of the preprocessing, but we could let the user select the level of lag and then make the columns.

standard time series models like arima or machine learning methods, that are already in mlr.

The goal is to build new tasks, resampling schemes, preprocessing, and prediction functions that work with the forecast package and the current algorithms implemented in mlr. I'll make new tasks, each being of the form makeTime<task_type>Task() for accounting for time in regression, classification, etc.

@berndbischl

a future release we talk about the design then.

Would you have time to chat this week via skype?

Yes I would like to hear @florianfendt 's opinion on integrating / extending mlr and time series.

SteveBronder commented 8 years ago

On my fork of mlr I have worked out Arima for the forecast package, but I'm not sure how to handle prediction https://github.com/Stevo15025/mlr

Note the following example

library(xts)
library(lubridate)

# make fake data
dat <- arima.sim(model = list(ar = c(.8,.1), ma = c(.4), order = c(2,0,1)), n = 1000)
times <- Sys.time() + days(1:1000)
dat <- xts(dat,order.by = times)
colnames(dat) <- c("arma_test")

dat.train <- dat[1:I(nrow(dat) - 5),]
dat.test <- dat[I(nrow(dat) - 5):nrow(dat),]

# fancy new time regression task
Timeregr.task = makeTimeRegrTask(id = "test", data = dat.train, target = "arma_test")
Timeregr.task
# Supervised task: test
# Type: regr
# Target: arma_test
# Observations: 995
# From: 2016-09-12 16:38:28
# To:   2019-06-03 16:38:28
# Features:
# numerics  factors  ordered 
#        1        0        0 
# Missings: FALSE
# Has weights: FALSE
# Has blocking: FALSE

registerS3method("makeRLearner", "timereg.Arima", makeRlearner.timereg.Arima)
registerS3method("trainLearner", "timereg.Arima", trainLearner.timereg.Arima)
registerS3method("predictLearner", "timereg.Arima", predictLearner.timereg.Arima)

# run learner and predict
arm = makeLearner("timereg.Arima", order = c(2,0,1), n.ahead = 5L)
mod = train(arm, Timeregr.task)
mod.pred = predict(mod, task = Timeregr.task, holdOut = dat.test)
mod.pred
# Prediction: 5 observations
# predict.type: response
# threshold: 
# time: 0.00
#   id truth  response
# 1  1    NA -1.728292
# 2  2    NA -1.562834
# 3  3    NA -1.411100
# 4  4    NA -1.273539
# 5  5    NA -1.148526

The truth has a value of NA since we are forecasting 5 steps ahead (n.ahead = 5) and we do not know what the target variable's values will be five steps into the future. While this is logically fine, it does put a damper on performance metrics

performance(mod.pred,medse)
# medse 
#    NA 

You can see above I tried to pass a holdout set that contains the next five predictions, but predict.WrappedModel() seems to ignore the holdOut parameter. Any thoughts on what I should do here? Maybe it would be a good idea to have a new predict.type?

PhilippPro commented 8 years ago

Ok, some comments/questions from me, although Bernd is more important. ;)

berndbischl commented 8 years ago

Would you have time to chat this week via skype?

thats quite hard, i am on a workshop marathon. weekend might be ok, if you dont hate that. can you please write an email for that so we dont clutter up the thread here? me = bernd_bischl@gmx.net

SteveBronder commented 8 years ago

@PhilippPro

although Bernd is more important. ;)

I value and am happy to have your comments and questions!!

I would rather call it timeregr.arima, to be consistent with regr

Good to know, the function name in forecast is Arima() and I did not want people to be confused with base R's arima(), but if it is more consistent with regr I will change that.

n.ahead should not be specified (in the makeLearner step)?

n.ahead has to be specified in the makeLearner() because it is a parameter used at prediction time. When I try to put it in predict() it does not pass through to Arima's predict function.

If you have no newdata (see next point) and only the task in predict, you could just predict the old data (as far as possible)

The predict() function for Arima() actually does not have a parameter for new data. This is because predict.Arima() uses KalmanForecast(n.ahead, object$model) to make the next n.ahead predictions. So as far as possible is actually pretty much forever.

dat.test is giving you the dates, so for all dates of dat.test there should be a prediction?

Yes, as we create a windowing method, our holdout set ( in this case dat.test) will be the same length as how far we are trying to forecast.

I would use the newdata argument for putting dat.test in. And then you can give also truth and response?

Because the target is also our predictor, predict.WrappedModel() throws out the target (predictor) variable when it is in new data. So we would have to put in a new if statement before line 82 of predict

newdata = newdata[, -t.col, drop = FALSE]

The culmination of all of these question leads me to think that we should have a new forecast.WrappedModel() function that can handle forecasting separately from normal prediction.

The new function would look something like

forecast(object = arma, task = Timeregtask, holdout = dat.test, ...)

So that when we forecast five periods in the future holdout would have those five 'truth' values we left out. I think creating a new function is better because, for the user, it clearly differentiates between predicting today and predicting tomorrow.

What happens if the differences of the dates are not always the same?

This is a very good point and something that could be checked against in a new forecast() function.

Arima is not applicable anymore but other techniques.

Yes this is correct, I think for now we should handle cases where dates come in a sequential pattern and then throw an error when the differences between dates is not all the same. In the future, given certain models, we can allow for the observations to be non-sequential.

What happens if you want to include some independent variables?

Arima has a parameter called xreg which accepts a matrix of independent variables to be used within the model. Then you have to pass those newxregs to Arima's predict() function. I left this out for now as I cannot figure out which param functions allow for the user to bring in a matrix as the parameter. Does that exist? Something like makeNumericMatrixParam()?

@berndbischl Weekend would be fine for me, though if you need time for R&R after your workshops we can talk the next week. I will email you shortly.

PhilippPro commented 8 years ago

Ok, then call it timeregr.Arima. ;)

At the moment I think, that the Arima function doesn't really fit into the current mlr structure, as you pointed out. Estimating a simple AR process could also be done by just adding lag variables and then choosing an arbitrary modell.

The independent variables could be just included in the dataframe when creating the task and in the newdata as extra columns in newdata, as usual, I think.

SteveBronder commented 8 years ago

At the moment I think, that the Arima function doesn't really fit into the current mlr structure,

Yes, though with an addition of a forecast() function I think it will sit nicely in the framework.

Estimating a simple AR process could also be done by just adding lag variables and then choosing an arbitrary modell.

Yes! One of my goals includes allowing simple lags and differences inside of the models already in mlr as a preprocessing feature. Then we can use all these models in the context of time series.

The independent variables could be just included in the dataframe when creating the task and in the newdata as extra columns in newdata, as usual, I think.

Yes, though there still needs to be a holdout set, as the new periods we are predicting in the future will not have realized values for the newdata. (sorry if that wording is confusing)

SteveBronder commented 8 years ago

Practice script for the time series extension of mlr: https://github.com/Stevo15025/mlr

library(xts)
library(lubridate)

dat <- arima.sim(model = list(ar = c(.5,.2), ma = c(.4), order = c(2,0,1)), n = 10000)
times <- Sys.time() + days(1:10000)
dat <- xts(dat,order.by = times)
colnames(dat) <- c("arma_test")

Timeregr.task = makeTimeRegrTask(id = "test", data = dat, target = "arma_test")
Timeregr.task

arm = makeLearner("timeregr.Arima", order = c(2,0,1), n.ahead = 10L, include.mean = FALSE)
arm
resamp_desc = makeResampleDesc("GrowingCV", horizon = 10L, initialWindow = 8000L, task = Timeregr.task, skip = 50L)
resamp_desc
resamp_arm = resample(arm,Timeregr.task, resamp_desc, measures = mase)
resamp_arm

## Or do Tuning
par_set = makeParamSet(
  makeIntegerVectorParam(id = "order",
                         len = 3L,
                         lower = c(0L,0L,0L),
                         upper = c(3L,1L,1L),
                         tunable = TRUE),
  makeIntegerVectorParam(id = "seasonal",
                         len = 3L,
                         lower = c(0L,0L,0L),
                         upper = c(1L,1L,1L),
                         tunable = TRUE),
  makeLogicalParam(id = "include.mean",
                   default = FALSE,
                   tunable = TRUE),
  makeNumericParam(id = "n.ahead",
                   default = 10,
                   tunable = FALSE,
                   lower = 10,
                   upper = 10)
)

#Specify tune by grid estimation
ctrl = makeTuneControlGrid()

#
PhilippPro commented 7 years ago

Maybe this blog post helps you: https://www.r-bloggers.com/better-model-selection-for-evolving-models/

SteveBronder commented 7 years ago

@PhilippPro thanks!

You can see the newest updates to time series in mlr on my fork https://github.com/Stevo15025/mlr

SteveBronder commented 7 years ago

To update this, see PR #1318 which houses the forecasting branch extension

pat-s commented 5 years ago

Cleaning up / closing old issues. I assume that this enhancement is not of interest anymore after all this time. Feel free to re-open if s.o. is tackling it again.

edgBR commented 4 years ago

Is this still ongoing?

I have noticed that growing CV is implemented now in:

https://mlr.mlr-org.com/reference/makeResampleDesc.html

Is there any proper documentation for this?

pat-s commented 4 years ago

No, there is no proper support for time-series in mlr. We are currently about to support this in https://github.com/mlr-org/mlr3forecasting for the new mlr3 framework.

mlr won't get any new features.

edgBR commented 4 years ago

Hi pat-s,

Thank you for your answer. I have also noticed that tidymodels is using rsample as backend to implement the train/tests splits: https://tidymodels.github.io/rsample/reference/rolling_origin.html

Do you know what is the official release date for the support of this feature in mlr3?

berndbischl commented 4 years ago

Do you know what is the official release date for the support of this feature in mlr3?

maybe in feb 2020 might somewhat be realistic? ping us in mlr3 if you want some updates there. also right now might be the correct time - to bridge the waiting :) - to have a look at mlr3 and the book to learn the new API and features and so on. if you want to comment on the api for forecasting we are also very happy to hear ideas