nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
316 stars 78 forks source link

redefine phi relevance function: all points are 0 #2

Open sam-redbox opened 4 years ago

sam-redbox commented 4 years ago

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/smogn/phi.py:81: RuntimeWarning: divide by zero encountered in double_scalars delta.append((y_rel[i + 1] - y_rel[i]) / h[i]) redefine phi relevance function: all points are 0

nickkunz commented 4 years ago

Thank you for using this Python implementation of SMOGN. I apologize for the delay. It appears that perhaps the distribution of your y response variable does not contain box plot extremes in order for the Φ function to automatically determine which range of values to over-sample.

Please consider either reducing the rel_coef argument's default value or manually specifying the range of values to over-sample and under-sample, as exhibited here: https://github.com/nickkunz/smogn/blob/master/examples/smogn_example_3_adv.ipynb

parisaazimaee commented 4 years ago

@nickkunz hi Nick, i am encountering the same issue and i was wondering how you did specify the range of values to over-sample and under-sample in the example

Bahar1978 commented 4 years ago

Hello, thanks for SMOGN. Unfortunately I have the same issue. Could you please guide us how should we solve it.

devGichanLee commented 4 years ago

Thank you for using this Python implementation of SMOGN. I apologize for the delay. It appears that perhaps the distribution of your y response variable does not contain box plot extremes in order for the Φ function to automatically determine which range of values to over-sample.

Please consider either reducing the rel_coef argument's default value or manually specifying the range of values to over-sample and under-sample, as exhibited here: https://github.com/nickkunz/smogn/blob/master/examples/smogn_example_3_adv.ipynb

I checked and re-define rel_coef and rg_mtrx, but it doesn't work. I saw so many issues opened about this. is there any update plan for this issue? thx.

Bahar1978 commented 4 years ago

Thank you for using this Python implementation of SMOGN. I apologize for the delay. It appears that perhaps the distribution of your y response variable does not contain box plot extremes in order for the Φ function to automatically determine which range of values to over-sample. Please consider either reducing the rel_coef argument's default value or manually specifying the range of values to over-sample and under-sample, as exhibited here: https://github.com/nickkunz/smogn/blob/master/examples/smogn_example_3_adv.ipynb

I checked and re-define rel_coef and rg_mtrx, but it doesn't work. I saw so many issues opened about this. is there any update plan for this issue? thx.

Hi, could you please let me know how did you calculated the rel_coef ?

jruots commented 3 years ago

I'm experiencing the same issue as well. With rel_method = 'auto' I have not, for the life of me, managed to overcome the "all points are 0" issue. What's even weirder is that with a subset of my dataset (100 rows) this has worked fine, but with the original data set (500k rows) I get this issue. I've painstakingly checked that the subset is a good representation of the original data set, but can't spot the issue.

With rel_method = 'manual' I've had better luck, but it's still not great. The array for rel_ctrl_pts_rg becomes huge with a large dataset, because you need to define a lot of values that you are interested in oversampling and a lot of values you are interested in undersampling. This then makes smogn.smoter() very slow. With the 500k rows of data I'm looking at about 36 hours to complete the operation. With the manual method it would be nice if you could still set a simple threshold, e.g. assign relevance of 1 to all values equal to or greater than 5 and relevance 0 to all values less than 5.

Of course, there is the possibility that regardless of whether rel_method is 'auto' or 'manual', that for large datasets smogn.smoter() will be very slow. It would be nice to be able to confirm this though by trying both methods for rel_method.

dptrsa-300 commented 3 years ago

+1 Same issue here. Issue #13 is also a duplicate of this.

SafetyMary commented 2 years ago

Similar issue here. "redefine phi relevance function: all points are 1"

I have dig into the code and found the problem.

TLDR: update line 71-81 in box_plot_stats.py to make sure boxplot_stats["stats"] and boxplot_stats["xtrms"] is not an empty array

The root cause seems to be smogn.box_plot_stats() not generating a valid dictionary about the distribution of y. A valid dictionary should contain 'stats' and 'xtrms'. "all points are 1" error is due to empty 'xtrms' array and "all points are 0" error is due to empty 'stats' array. This in turn leads to smogn.phi_ctrl_pts() not generating a valid phi_params dictionary (missing under-sample or over-sample ctrl_pts) and thus the relevance function phi will not be valid.

gauraviiita commented 2 years ago

I was facing the same problem but by reducing the value of rel_coef = 0.50 or less than 0.50 helped me. Thank you nick.

faridelya commented 2 years ago

i am facing the issue : every thing is same as you mention in example of this library please help us out. my file is image

12 KeyError Traceback (most recent call last)

C:\Anaconda\envs\datasynthetic\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err:

C:\Anaconda\envs\datasynthetic\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

C:\Anaconda\envs\datasynthetic\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'VOLUME'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_12836\1609191792.py in 3 4 data= df, ----> 5 y ="VOLUME" 6 )

C:\Anaconda\envs\datasynthetic\lib\site-packages\smogn\smoter.py in smoter(data, y, k, pert, samp_method, under_samp, drop_na_col, drop_na_row, replace, rel_thres, rel_method, rel_xtrm_type, rel_coef, rel_ctrl_pts_rg) 135 136 ## determine column position for response variable y --> 137 y_col = data.columns.get_loc(y) 138 139 ## move response variable y to last column

C:\Anaconda\envs\datasynthetic\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'VOLUME'

Saumyadav commented 1 year ago

I am getting error "ValueError: redefine phi relevance function: all points are 1" I made changes in rel_coef, not working for me. Is there any specific method to find rel_coef value?

Saumyadav commented 1 year ago

Issue resolved. Using "## conduct smogn train_smogn =smogn.smoter(data=t_data.reset_index(drop=True), y="Labels")" resetting its index

imprasukjain commented 11 months ago

Do you happen to have any updates regarding this error? How do I resolve this as I am also getting the same error.......

Axiid-7 commented 4 months ago

If reducing the rel_coef to 0.50 is not working for you, you could declare your values of interest and non-interest in the ### rel_ctrl_pts_rg matrix. As mentioned in the following example, ensure that your values are in ascending order. rel_ctrl_pts_rg = [ [0,0,0], [1,0,0], [189,1,0], [217,1,0], [1208,1,0], [1212,1,0], [1225,1,0], [1287,1,0], ]