Working with scalable ML frameworks

kevinykuo commented 9 years ago

There has been a trend toward scalable and distributed frameworks for machine learning, and I think it may be worth exploring whether we can/should extend the mlr infrastructure to accommodate for that.

As an example, I've been shifting my ML workflow to H2O, and currently it's not viable to write mlr wrappers for the h2o.* functions. To start, creating a task fails due to the assertDataFrame call (a H2O data frame lives in a Java K/V store layer and there's an R object pointing to it). One way to get around this could be to work with R data frames and pass the data to H2O each time we train, but that's not viable due to all the I/O.

At first glance, it should be possible to relax the type of "data" we associate with tasks and perhaps generalize subsetting and sampling. I'm new to mlr and have only begun thinking about this, so I'm interested in what others' thoughts are.

One could argue that such a framework really belongs to the underlying libraries, but we'll likely see more of these in the future (perhaps with SparkR and MLlib, as an example), and it would benefit the community if we had one unifying framework to do ML. Looking at the competition, the Python community seems to have work underway to integrate scikit-learn and PySpark, and I think this is something we can at least think about.

larskotthoff commented 9 years ago

At the mlr layer it should certainly be possible to change/extend what data can be handled, but this would most likely involve quite a bit of plumbing to make it work with the underlying learners (which expect data frames). This would probably require converting to R data frames at least in some cases.

An alternative would be to add a new class of tasks and learners for this kind of thing. I'm not really familiar with H2O; how do they handle learners there? Presumably they are specific H2O learners and not generic ones?

kevinykuo commented 9 years ago

Sorry for the late reply -- been a busy week.

We can think of h2o as another package which contains specific learners, e.g. h2o.gbm() is like gbm() except it takes as data input an H2OParsedData object which is a pointer in the R environment to a dataset living in the h2o cluster:

library(h2o)
library(magrittr)
localH2O <- h2o.init()
iris_h2o <- as.h2o(localH2O, iris, "iris_h2o")
typeof(iris_h2o)
# [1] "S4"
iris_h2o[1:2,] # subsetting (e.g. for CV works)
# IP Address: 127.0.0.1 
# Port      : 54321 
# Parsed Data Key: Last.value.4 
# 
# Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# Petal.Width Species
# 1         0.2  setosa
# 2         0.2  setosa

y <- "Species"
x <- names(iris_h2o) %>% setdiff(y)
model1 <- h2o.gbm(x = x, y = y, data = iris_h2o, distribution = "multinomial")
typeof(model1)
# [1] "S4"

h2o.predict(model1, iris_h2o) %>% as.data.frame %>% head
# predict    setosa versicolor virginica
# 1  setosa 0.7995457  0.1002270 0.1002272
# 2  setosa 0.7995457  0.1002270 0.1002272
# 3  setosa 0.7995453  0.1002272 0.1002275
# 4  setosa 0.7995453  0.1002272 0.1002275
# 5  setosa 0.7995457  0.1002270 0.1002272
# 6  setosa 0.7995457  0.1002270 0.1002272

As we can see, the heavy lifting, predictions, etc. are handled on the server side, but there are ways to extract the relevant information back to R. If we can generalize the data input a bit for mlr, I think play around with writing the learners, but I'm not sure where would be the best layer to do that, e.g. do we take a character vector of name of the data object and handle the rest on the learner side, or do we allow for more data types?

larskotthoff commented 9 years ago

Hmm, would it make sense to handle this in a conversion layer for H2O to avoid duplicating tasks/learners/etc? Are there any reasons why you would want more fine-grained control over what's passed to H2O (e.g. need to set custom options)?

kevinykuo commented 9 years ago

@larskotthoff Everything function parameter-wise I want to pass to H2O I can do so through the learner I create. What I'm having trouble with is how to access the data from the learner. One thing we could do is to capture the data promise in the task, then inside the learner I can do something in the spirit of get(paste0(substitute(data), "_h2o")), assuming we enforce a naming scheme. This way both R and H2O learners can share the same task, and would (probably) require minimal code changes. However, if the data is too big to fit in memory on one node, i.e. we have mydata_h2o but not mydata, then we'd have to relax the data frame assumption.

I plan to hack something together to illustrate this issue and learn more about mlr at the same time. In the meantime, I appreciate any ideas thrown my way.

larskotthoff commented 9 years ago

Ok, so basically a conversion function from mlr data to H2O data and back?

larskotthoff commented 7 years ago

No activity in almost two years, closing.

berndbischl commented 7 years ago

actually, h2o is already included since a longer time in mlr. so that has been completetd.

and Spark and more abstract data sources are really on our todo list. but this is a more difficult topic and cannot be handled in a simple issue.

pedropenzuti commented 5 years ago

Hi everyone! I love the package, but recently have started feeling the need to scale it. Where are the conversations at with Spark integration? Is there a different, recommended option that I'm unaware of?

Cheers

kevinykuo commented 5 years ago

@pedropenzuti if you wanna give parsnip a shot there's sparklyr integration, see https://github.com/tidymodels/parsnip

pedropenzuti commented 5 years ago

Will take a look. Thanks Kevin!

mlr-org / mlr

Working with scalable ML frameworks #277