Open invain1218 opened 4 days ago
Hey,
thanks for the issue! This sounds like a usecase for PipeOpSmoteNC / po("smotenc")
which is for balancing tasks with nominal and continuous data, so pretty much your case. It should also be a lot more straightforward. You might need to update mlr3pipelines as we only added it recently with version 0.7.0.
Note, however, that the parameters of the two PipeOps have different names and slightly different interpretation:
dup_size
in PipeOpSmote: Desired ratio of minority to majority instances.over_ratio
in PipeOpSmoteNC: Desired ratio of the majority to minority instances.This is due to the fact that we rely on different packages for the implementation of the two algorithms.
I'm not sure what they do in Python to make it work with Smote directly. It might be that they call SmoteNC under the hood, but again, I don't know.
Let me know, whether this works for you. 😃
Oh, thank you for the detailed explanation! It seems like a better fit for my case 😊.
Thanks for pointing out the difference between dup_size and over_ratio—that clarification is super helpful! I’ll adjust accordingly.
I am currently working with imbalanced data using the mlr3pipeline package and applying the SMOTE method for balancing. Since SMOTE requires numeric features, I used po("encode") to convert categorical variables into numeric format. However, I noticed that the final balanced data includes decimal values for the categorical features after applying SMOTE.
Would it be reasonable to apply a threshold to these values (e.g., splitting them into 0 and 1 based on a cutoff) to restore them to categorical form? If not, could you suggest a better approach to handle this situation?
Thank you in advance for your advice! 😊
when I use SMOTE in Python on the same dataset, the output balanced data retains the original raw type for categorical features. This has led me to wonder: would it be reasonable to apply a threshold to the decimal values (e.g., converting them back to 0 and 1) to restore the categorical variables in R? Or is there a better practice for handling such cases when working with categorical features in SMOTE?