This implements two new PipeOps for down-sampling inbalanced data, by calling themis functions:
PipeOpTomek: Removes Tomek Links, i.e. pairs of observations that are nearest neighbors and of different classes. Note that this is only one possible implementation of Tomek Links, which is used for data cleaning. There also exists an algorithm for balancing data, in which not both observations of a Tomek Link are removed but only the majroity class member of the pair. However, this is the only version currently implemented by themis.
PipeOpNearmiss: Removes instances of the non-minority classes based on the NEARMISS algorithm, i.e. the instances that have the smallest mean distance to the closest instances of other classes. This is, again, only one possible implementation, but the only one in themis.
The documentation in themis seems to contain a few errors (probably due to being copied from another function).
As of right now, these two pipeops ignore stratification completely.
I'm looking for some feedback about the way I filter the task based on the themis result. As the result is a data.table, called dt, I currently take the rownames of that result:
This implements two new PipeOps for down-sampling inbalanced data, by calling
themis
functions:PipeOpTomek
: Removes Tomek Links, i.e. pairs of observations that are nearest neighbors and of different classes. Note that this is only one possible implementation of Tomek Links, which is used for data cleaning. There also exists an algorithm for balancing data, in which not both observations of a Tomek Link are removed but only the majroity class member of the pair. However, this is the only version currently implemented bythemis
.PipeOpNearmiss
: Removes instances of the non-minority classes based on the NEARMISS algorithm, i.e. the instances that have the smallest mean distance to the closest instances of other classes. This is, again, only one possible implementation, but the only one inthemis
. The documentation inthemis
seems to contain a few errors (probably due to being copied from another function).As of right now, these two pipeops ignore stratification completely.
I'm looking for some feedback about the way I filter the task based on the
themis
result. As the result is adata.table
, calleddt
, I currently take the rownames of that result:This seems a bit clunky to me. An alternative I thought of would be
which I'd expect to be more robust but also computationally more intensive (don't know how efficient
fintersect
is).Partially addresses https://github.com/mlr-org/mlr3pipelines/issues/790.