mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
137 stars 25 forks source link

PipeOpSmoteNC #816

Open advieser opened 2 weeks ago

advieser commented 2 weeks ago

This implements Synthetic Minority Over-sampling Technique for Nominal and Continuous Data (SMOTENC) using themis::smotenc().

themis::smotenc accepts twoclass or multiclass targets and factor, ordered, numeric and integer features (contrary to the name "nominal and continious"), which is why we do too. NAs in any of the feature columns are not permitted.

Integer features are handled as if they were numeric by themis::smotenc. However, since we don't want to change the feature type, we round the generated data points back to the nearest integer. This implies that our pipeop does not lead to the same results as one would get by just using themis::smotenc.

For unsupported columns, this has the same implementation as PipeOpSmote in https://github.com/mlr-org/mlr3pipelines/pull/815. It should be checked first whether that implementation is OK, so it could be adjusted here as well.

closes https://github.com/mlr-org/mlr3pipelines/issues/784