mlr-org / mlrCPO

Composable Preprocessing Operators for MLR
Other
37 stars 4 forks source link

makeCPO: Number of rows of numeric data returned by t-sne did not match input #75

Open ghost opened 5 years ago

ghost commented 5 years ago

cpoTsne = makeCPOExtendedTrafo("t-sne", # nolint pSS(rank: integer[1, ]), dataformat = "numeric", cpo.trafo = function(data, target, rank) { outTsne= Rtsne(as.matrix(data), dims = rank, perplexity = 10, max_iter = 100) control = outTsne$Y }, cpo.retrafo = function(data, control, rank) { control }) lrn = cpoTsne(rank=2)%>>%makeLearner("classif.ksvm") resample(lrn, task, resampling = outer_loop, measures = list(mmce), show.info = FALSE)

Hello, if I execute the code above, I get the following error message: Error in recombineLL(df, newdata, targetcols, strict.factors, subset.index, : Number of rows of numeric data returned by t-sne did not match input CPO must not change row number.

This error message only appears if I use makeLearner() and resample(). But the following comand works: data%>>%cpoTsne(rank=2)
classif V1 V2 1 0 -2.055749824 -1.801610596 2 1 -0.469646936 3.347391844 3 1 -0.194586726 0.057422613 4 0 1.070363088 3.380600350 5 1 -0.567965508 3.096630889 ... The number of rows are the same as in data. Where is the problem? Must be in cpo.retrafo. Thanks

mb706 commented 5 years ago

The problem is that cpo.retrafo must consider the validation data during resampling. The CPO is run two times: for training data (cpo.trafo runs) and for the validation data (cpo.retrafo). What your implementation is doing is simply returning the transformed training data during the "retrafo" phase (completely ignoring the incoming validation data). The CPO framework notices that the number of rows does not match, but the problem sits deeper: There is no straightforward way for getting a corresponding prediction representation of a t-SNE transformation. t-SNE seems to be not well suited for preprocessing as part of a machine learning pipeline, because it is nonparametric and the model, once trained on transformed training data, would not be able to handle prediction data.