mlr-org / mlr3pipelines

Dataflow Programming for Machine Learning in R
https://mlr3pipelines.mlr-org.com/
GNU Lesser General Public License v3.0
140 stars 25 forks source link

Can I restore the categorial fatures after using po("smote")? #848

Open invain1218 opened 4 days ago

invain1218 commented 4 days ago

I am currently working with imbalanced data using the mlr3pipeline package and applying the SMOTE method for balancing. Since SMOTE requires numeric features, I used po("encode") to convert categorical variables into numeric format. However, I noticed that the final balanced data includes decimal values for the categorical features after applying SMOTE.

Would it be reasonable to apply a threshold to these values (e.g., splitting them into 0 and 1 based on a cutoff) to restore them to categorical form? If not, could you suggest a better approach to handle this situation?

Thank you in advance for your advice! 😊

task <- TaskClassif$new(id = "imbalanced", backend = data, target = "target")
smote_pipeline <- po("smote", id = "smote", dup_size = 1) 
encode_pipeline <- po("encode")
encode_task <- encode_pipeline$train(list(task))[[1]]
encode_task$data() |> head()
   target    feature1    feature2 feature3.A feature3.D
   <fctr>       <num>       <num>      <num>      <num>
1:      0 -0.56047565 -0.71040656          1          0
2:      0 -0.23017749  0.25688371          0          1
3:      0  1.55870831 -0.24669188          1          0
4:      0  0.07050839 -0.34754260          0          1
5:      0  0.12928774 -0.95161857          1          0
6:      0  1.71506499 -0.04502772          0          1
balanced_task <- smote_pipeline$train(list(encode_task))[[1]]
balanced_task$data()
     target    feature1   feature2 feature3.A feature3.D
     <fctr>       <num>      <num>      <num>      <num>
  1:      0 -0.56047565 -0.7104066 1.00000000  0.0000000
  2:      0 -0.23017749  0.2568837 0.00000000  1.0000000
  3:      0  1.55870831 -0.2466919 1.00000000  0.0000000
  4:      0  0.07050839 -0.3475426 0.00000000  1.0000000
  5:      0  0.12928774 -0.9516186 1.00000000  0.0000000
 ---                                                    
106:      1  4.47894028  1.7301485 0.31211279  0.6878872
107:      1  3.01043735  2.7549804 1.00000000  0.0000000
108:      1  3.70699096  2.7648612 0.09676644  0.9032336
109:      1  3.24697515  3.0958926 1.00000000  0.0000000
110:      1  3.47680528  2.6405817 0.00000000  1.0000000

when I use SMOTE in Python on the same dataset, the output balanced data retains the original raw type for categorical features. This has led me to wonder: would it be reasonable to apply a threshold to the decimal values (e.g., converting them back to 0 and 1) to restore the categorical variables in R? Or is there a better practice for handling such cases when working with categorical features in SMOTE?

advieser commented 3 days ago

Hey,

thanks for the issue! This sounds like a usecase for PipeOpSmoteNC / po("smotenc") which is for balancing tasks with nominal and continuous data, so pretty much your case. It should also be a lot more straightforward. You might need to update mlr3pipelines as we only added it recently with version 0.7.0.

Note, however, that the parameters of the two PipeOps have different names and slightly different interpretation:

This is due to the fact that we rely on different packages for the implementation of the two algorithms.

I'm not sure what they do in Python to make it work with Smote directly. It might be that they call SmoteNC under the hood, but again, I don't know.

Let me know, whether this works for you. 😃

invain1218 commented 1 day ago

Oh, thank you for the detailed explanation! It seems like a better fit for my case 😊.
Thanks for pointing out the difference between dup_size and over_ratio—that clarification is super helpful! I’ll adjust accordingly.