nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
319 stars 78 forks source link

Add random_seed parameter #28

Closed ArnoldYSYeung closed 2 years ago

ArnoldYSYeung commented 2 years ago

Allows user to input a random_seed for repeating runs. If None, does not use random_seed. Code has been fixed to apply to every instance of np.random and rd.random. (Seed must be defined prior to every use of np.random and rd.random.)

Code has been tested on smogn/examples/smogn_example_1_beg.ipynb . Using the same random_seed value always output the same stats.

housing_smogn = smogn.smoter(

    data = housing,  ## pandas dataframe
    y = 'SalePrice',  ## string ('header name')
    random_seed = 1
)
smogn.box_plot_stats(housing_smogn['SalePrice'])['stats']
ArnoldYSYeung commented 2 years ago

@nickkunz The modifications here have been tested with the smogn/examples/smogn_example_1_beg.ipynb notebook. Please test with more complex cases as well. Thanks.

Muti23 commented 2 years ago

@ArnoldYSYeung Thanks for your sharing. In fact I also used the same strategy to deal with my problem. I try to implant random seed for several rows in smoter/over_sampling files. Unfortunately it doesn't solve the problem completely, and only increase the reproducibility of test results. I don't know if you have the same situation? And I will use your modifications to do the same test.

polinamamo commented 2 years ago

@Muti23 my understanding after reading the code, that you need to fix seed for both np.random.seed(seed) rd.seed(seed) at the same time, as both methods are used for random sampling.

nickkunz commented 2 years ago

@ArnoldYSYeung @Muti23 @polinamamo thank you all for your input. This is very helpful for everyone using the package.

ArnoldYSYeung commented 2 years ago

@Muti23 my understanding after reading the code, that you need to fix seed for both np.random.seed(seed) rd.seed(seed) at the same time, as both methods are used for random sampling.

@polinamamo Are you suggesting we should include rd.seed(seed) wherever we have np.random.seed(seed) and vice versa?

Right now, my PR only includes one when the following lines use that specific random seed. (E.g., only rd.seed(seed) is included before rd.choices(...) and only np.random.seed(seed) is included before np.random.choice(...)) From my test case, it seems this is sufficient for reproducing the results, but please let me know if that's not the case, and I can include both together whenever using either random seed 🙏🙏

ArnoldYSYeung commented 2 years ago

@nickkunz Renamed random_seed to seed and removed pycache files. Please take a look and let me know of any other changes 🙏

ArnoldYSYeung commented 2 years ago

@nickkunz Changed base branch from master to dev

nickkunz commented 2 years ago

@ArnoldYSYeung Thank you. I really appreciate your contribution. This needed to be done.