nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
319 stars 78 forks source link

SMOGN is creating a new class for target #38

Open purp172 opened 2 years ago

purp172 commented 2 years ago

Hey! Any idea on why is the algorithm creating a new class (value) for my target? I'm analyzing the Room_Occupancy_Dataset from Kaggle, and in this dataset the target only has four values for occupancy (0, 1, 2, 3 people in the room), but it is expected for the model to be able to predict other cases that have more than 3 people in the room. SMOGN is not balancing the data correctly, because the majority class (0) remains equal, and the minority classes (1,2,3) are not over-sampled. Plus, it creates an extra value (4). I don't know if this is a bug, but i hope you can help me fix it. This is my 2d array:

rg_mtrx = [

    [0, 0, 0],  ## under-sample ("majority")
    [1, 1, 0],  ## over-sample ("minority")
    [2, 1, 0],  ## over-sample ("minority")
    [3, 1, 0],  ## over-sample ("minority")
]

## conduct smogn
balanced_smogn = smogn.smoter(

    ## main arguments
    data = df,            ## pandas dataframe
    y = 'Room_Occupancy_Count', ## string ('header name')
    k = 5,                    ## positive integer (k < n)
    pert = 0.02,              ## real number (0 < R < 1)
    samp_method = 'extreme',  ## string ('balance' or 'extreme')
    drop_na_col = False,       ## boolean (True or False)
    drop_na_row = False,       ## boolean (True or False)
    replace = True,          ## boolean (True or False)

    ## phi relevance arguments
    rel_thres = 0.50,         ## real number (0 < R < 1)
    rel_method = 'manual',    ## string ('auto' or 'manual')
    # rel_xtrm_type = 'both', ## unused (rel_method = 'manual')
    # rel_coef = 1.50,        ## unused (rel_method = 'manual')
    rel_ctrl_pts_rg = rg_mtrx ## 2d array (format: [x, y])
)
nickkunz commented 1 year ago

Hello @Diogo-da-Silva-Rebelo, SMOGN was developed for regression. It seems like your problem is a classification one? If that is the case then SMOGN would note be useful. You may want to see if SMOTE is more appropriate. Thank you.

purp172 commented 1 year ago

Hello @Diogo-da-Silva-Rebelo, SMOGN was developed for regression. It seems like your problem is a classification one? If that is the case then SMOGN would note be useful. You may want to see if SMOTE is more appropriate. Thank you.

Hello, @nickkunz ! Thank you for responding. I don't think that's the case: I want to predict the number of people in the room, and not a specific class (not if the room has or not people inside). In fact, there's many values for the target and not only a restricted number. However, the target values must be integers, because we can't have 1.2 persons in the room :) Thus, it is a regression problem, when I said that the dataset only has four values, it does not mean that I can't have another values for instance in my test dataset. The algorithm is leaving all rows with the target = 0, even being that the value in majority. And it's not balancing, since the other values remain intact. What are you thoughts?