Closed andreassot10 closed 1 month ago
Also on StackOverflow
I think this can be seen from two perspectives:
K
in this instance. In the latter case, I think this should be moved to mlr3pipelines
, as this basically is a problem of how we implemented the particular PipeOp
.
In general, though we might also wanna define how we want to deal with such situations where we'd basically have to manipulate hyperparameters so downstream tasks do not fail.I think this can be seen from two perspectives:
* If you specify an invalid parameter, the method **should** fail, and a backup learner might catch the error. In this case, we really need to provide an example in the mlr3gallery on how this is done. * We as package maintainers try to robustify implementations to that degree. This results in equal results for various parameters `K` in this instance. In the latter case, I think this should be moved to `mlr3pipelines`, as this basically is a problem of how we implemented the particular `PipeOp`. In general, though we might also wanna define how we want to deal with such situations where we'd basically have to manipulate hyperparameters so downstream tasks do not fail.
These are good points. I agree that the method should fail, when the user specifies an upper bound for K
that is greater than or equal to the number of records in the (training) data. But I think that there is a certain peculiarity when it comes to SMOTE: even when the supplied upper bound for K
is smaller than the number of records, the method still may fail. That's because SMOTE we may want to tune the pipeline with e.g. CV, where SMOTE is applied on the data subsets (e.g. CV folds). The number of records of these subsets may or may not be greater than K
. Each and every time the user changes the train/test split and/or the number of CV folds, they also have to readjust the upper bound for K
.
It's a tricky one, because there is a certain degree of validity in both points you are making, while it's still not easy to decide which one to go for.
I've implemented a fix here with a trafo
that forces K
to always be smaller than the number of records in the data subsets. It does come with drawbacks though, as it is likely that the value of K
can be the same in different runs. Also, note that it depends on CV instantiation, so cv$instantiate(task)
will have to precede certain commands.
I hope this helps.
1) Ok, a couple of comments. I think this is somewhat of a "fringe issue"; sorry @andreassot10. Currently, you might have to live with this not working out for you ideally.
2) The value of K refers to not was in the task, completely, but what is seen inside of the training set. Thats just how it is.
But you basically want is that K changes, depending of the size of the training set, correct?
We can either implement that in SMOTE, as a robust behavior of the algorithm, then its a mlr3pipelines issue, or we can make the trafos better.
@pfistfl what do you think?
1. Ok, a couple of comments. I think this is somewhat of a "fringe issue"; sorry @andreassot10. Currently, you might have to live with this not working out for you ideally. 2. The value of K refers to not was in the task, completely, but what is seen inside of the training set. Thats just how it is.
But you basically want is that K changes, depending of the size of the training set, correct?
Thanks @berndbischl. My answers to your comments in reverse order:
2 Exactly. That's where SMOTE is likely to fail. So I'd indeed like the upper bound of K
to change depending of the size of the training set. So if you're running a k-fold cross-validation, K
should be smaller than the number of rows of the dataset comprising of the k-1 folds.
1 Can definitely live with it for the time being. If I come up with better fix that ensures the values of K
aren't duplicated every now and then, I'll make it public.
Thanks
This should be possible now with callbacks.
Hello,
I'm having trouble with the trafo function for
SMOTE {smotefamily}
'sK
parameter. In particular, when the number of nearest neighboursK
is greater than or equal to the sample size, an error is returned (warning("k should be less than sample size!")
) and the tuning process is terminated.The user cannot control
K
to be smaller than the sample size during the internal resampling process. This would have to be controlled internally so that if, for instance,trafo_K = 2 ^ K >= sample_size
for some value ofK
, then, say,trafo_K = sample_size - 1
.I was wondering if there's a solution to this or if one is already on its way?
And here's what happens
Session info
Many thanks.