mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

New Down-Sampling PipoOps (Tomek, Nearmiss) based on `themis` #817

Open advieser opened 2 weeks ago

advieser commented 2 weeks ago

This implements two new PipeOps for down-sampling inbalanced data, by calling themis functions:

As of right now, these two pipeops ignore stratification completely.

I'm looking for some feedback about the way I filter the task based on the themis result. As the result is a data.table, called dt, I currently take the rownames of that result:

keep = as.integer(row.names(dt))
task$filter(keep)

This seems a bit clunky to me. An alternative I thought of would be

keep = as.integer(row.names(fintersect(task$data(), dt)))

which I'd expect to be more robust but also computationally more intensive (don't know how efficient fintersect is).

Partially addresses https://github.com/mlr-org/mlr3pipelines/issues/790.