mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.65k stars 405 forks source link

makeTask function that guesses task type from output #1176

Closed berndbischl closed 8 years ago

berndbischl commented 8 years ago

in some applications it would be nice if we would have a generic makeTask function (better name?) that guesses whether it is classif, regr, and so on from the data type of the target(s).

save these annoying if-cascades like here

  y = data[,tn]
  if (is.numeric(y))
    makeRegrTask(id = id, data = data, target = tn)
  else if (is.factor(y))
    makeClassifTask(id = id, data = data, target = tn)
jakob-r commented 8 years ago

If this will be your code I already see it failing. We have to specify in which cases we are sure that it is regression and in which it is classification and I propose to break otherwise.

# Classification?
y = c(0,1)
# Regression?
y = c(0.1, 0.2)
berndbischl commented 8 years ago

i dont get your point. the function would be a little helper that obviously cannot work magic. there would be clear documented rules how it would work. so it cannot "fail" in that sense.

possible rules could be a) if y is numeric (including integer) the task is always "regr". in that case both cases in your post are regression. b) if y is numeric (including integer) the task is always "regr", UNLESS it hast less than k distinct values. then it is classification.

the point is to make life more convenient for the user in some cases. if you want to have full control, you can always use the already existing functions.

if you have a numeric / int with, e.g., 10 distinct values, conceptually it is unclear whether this is classif or regr. the user HAS to tell you.

now you can take 2 stances

a) say what i am proposing here is unnecessary, and there can never be a "perfect" solution.

b) you see that point that it makes the user's life easier, sometimes. then come up with a good rule system how the class type it auto-detected.

jakob-r commented 8 years ago

I was just saying that we need well defined restrictions. The problem is always that the user might not now the k and is suddenly surprised that he gets a classification task.

But my proposal is to be strict

berndbischl commented 8 years ago

I was just saying that we need well defined restrictions.

sure. and my code above was also just a stupid example, not a definitive answer.

but do you think that such a function in the end is helpful? or shall i simply close this here? really dont know.

anyway more comments: if y is multiple logical columns, you can set it to multilabel. if no y is set, clustering.

i dont know. is this really useful?

berndbischl commented 8 years ago

I was just saying that we need well defined restrictions.

sure. and my code above was also just a stupid example, not a definitive answer.

but do you think that such a function in the end is helpful? or shall i simply close this here? really dont know.

anyway more comments: if y is multiple logical columns, you can set it to multilabel. if no y is set, clustering.

i dont know. is this really useful?

masongallo commented 8 years ago

i dont know. is this really useful?

My opinion is that while this may be useful to shorten code for ppl who know what they're doing with mlr API, it is also dangerous and can lead to problems downstream, especially for users new to our API. I prefer forcing the user to specify to prevent any surprises / unintentional bugs downstream.

larskotthoff commented 8 years ago

I agree with Mason. It would save only a few characters, but potentially do weird and non-obvious things, especially if you use the same script with different data.

berndbischl commented 8 years ago

thx to everybody for the feedback. i will close this now, the benefits seems not super obvious. we can open it next time, if someone sees clear advantages.