Open ivan-marroquin opened 4 years ago
any comments?
Hello,
Thank you for using SMOGN. It appears that your data does not contain outliers in order to automatically generate regions of over-sampling. Please advise.
Hi @nickkunz
thanks for looking into this issue. The background data consist of zeros, while the outliers are values higher than 0.50 (see attached plot)
Hope this helps,
Ivan
Hi @nickkunz
Hoping that you are doing well. I was wondering if you had the chance to look into this issue?
Kind regards, Ivan
Hello, thanks for SMOGN. Unfortunately I have the same issue. Could you please guide us how should we solve it.
Hi Nick,
I am also getting this error, and I have a theory. My data is very skewed: insurance data where 95% of claims are zero. I'd like SMOGN to oversample the other 5% but, I think, there are so many zero values that it doesn't identify the others as outliers. This theory is consistent with Ivan's situation. I hope this helps!
Best, Mark
Hi @ivan-marroquin, I came across the same error.
And until the dev fixes this, there's a work around you can implement.
Assuming that you work locally, go to the location where the package is installed.
For me it was "C:\Users\user_name\Anaconda3\envs\project_3\Lib\site-packages\smogn"
Open smoter.py and comment out the following lines:
if all(i == 0 for i in y_phi):
#raise ValueError("redefine phi relevance function: all points are 1")
if all(i == 1 for i in y_phi):
#raise ValueError("redefine phi relevance function: all points are 0").
Then restart the kernel, import the smogn and this issue should be fixed.
Hi @rkrishna116
Thanks for the workaround! I will give a try.
I found another approach to solve the need of minority values in continuous data, and it is "data discretization". Here is a link to find more about https://www.includehelp.com/basics/data-discretization-in-data-mining.aspx
There are plenty of statistical approaches that can be used to estimate the optimal number of bins to discretize your continuous data. Good luck!
Ivan
Hi Nick,
Many thanks for making this package available!
With my data set and following the code example for the intermediate exercise, I bumped into this error message: redefine phi relevance function: all points are 0
Checking the source code, I noticed that there is a safeguard:
if all(i == 1 for i in y_phi): raise ValueError("redefine phi relevance function: all points are 0")
but I could not further understand how this links to my data. I am using Python 3.6.5 on a windows machine and smogn 0.1.2
I attached a copy of the script and input data.
Thanks for your help,
Ivan
Testing_SMOGN_package.zip