nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
316 stars 78 forks source link

Error message: redefine phi relevance functions #12

Open ivan-marroquin opened 4 years ago

ivan-marroquin commented 4 years ago

Hi Nick,

Many thanks for making this package available!

With my data set and following the code example for the intermediate exercise, I bumped into this error message: redefine phi relevance function: all points are 0

Checking the source code, I noticed that there is a safeguard:

if all(i == 1 for i in y_phi): raise ValueError("redefine phi relevance function: all points are 0")

but I could not further understand how this links to my data. I am using Python 3.6.5 on a windows machine and smogn 0.1.2

I attached a copy of the script and input data.

Thanks for your help,

Ivan

Testing_SMOGN_package.zip

ivan-marroquin commented 4 years ago

any comments?

nickkunz commented 4 years ago

Hello,

Thank you for using SMOGN. It appears that your data does not contain outliers in order to automatically generate regions of over-sampling. Please advise.

ivan-marroquin commented 4 years ago

Hi @nickkunz

thanks for looking into this issue. The background data consist of zeros, while the outliers are values higher than 0.50 (see attached plot)

Hope this helps,

Ivan

input_data_smogn

ivan-marroquin commented 4 years ago

Hi @nickkunz

Hoping that you are doing well. I was wondering if you had the chance to look into this issue?

Kind regards, Ivan

Bahar1978 commented 4 years ago

Hello, thanks for SMOGN. Unfortunately I have the same issue. Could you please guide us how should we solve it.

mvirag2000 commented 3 years ago

Hi Nick,

I am also getting this error, and I have a theory. My data is very skewed: insurance data where 95% of claims are zero. I'd like SMOGN to oversample the other 5% but, I think, there are so many zero values that it doesn't identify the others as outliers. This theory is consistent with Ivan's situation. I hope this helps!

Best, Mark

rkrishna116 commented 3 years ago

Hi @ivan-marroquin, I came across the same error.

And until the dev fixes this, there's a work around you can implement.

Assuming that you work locally, go to the location where the package is installed.

For me it was "C:\Users\user_name\Anaconda3\envs\project_3\Lib\site-packages\smogn"

Open smoter.py and comment out the following lines:

if all(i == 0 for i in y_phi):
        #raise ValueError("redefine phi relevance function: all points are 1")
    if all(i == 1 for i in y_phi):
        #raise ValueError("redefine phi relevance function: all points are 0").

Then restart the kernel, import the smogn and this issue should be fixed.

ivan-marroquin commented 3 years ago

Hi @rkrishna116

Thanks for the workaround! I will give a try.

I found another approach to solve the need of minority values in continuous data, and it is "data discretization". Here is a link to find more about https://www.includehelp.com/basics/data-discretization-in-data-mining.aspx

There are plenty of statistical approaches that can be used to estimate the optimal number of bins to discretize your continuous data. Good luck!

Ivan