nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
319 stars 78 forks source link

New metrics, resolved conflicts #37

Open KulikDM opened 2 years ago

KulikDM commented 2 years ago

Not sure if this will create a new pull request, but if it does:

Deleted dist_metrics.py as updated metric calculations will be done internally in over_sampling.py. Deleted old versions of smoter.py and over_sampling.py to be replaced.

New smoter.py has added arguments for over_sampling.py ("metric" and "metric_args"). These pertain to the new implementation of native distance calculations now done in over_sampling.py.

New over_sampling.py can now do all distance calculations natively and does them almost in real time compared to the long run times due to the local python for loops used in over_sampling.py and dist_metrics.py. The "metric" argument allows for different metrics to be used with metrics being implemented from scipy.spatial.distance for numerical and categorical data. And two metrics exist for heterogeneous data HEOM.py and HVDM.py that have been taken from the distython repo (https://github.com/KacperKubara/distython). The "metric_args" can be included as a dict for additional scipy.spatial.distance metric functions arguments.

Comparison of previous and updated runs on the included data "housing.csv" as well as private numerical and catagorical data show that all the different metrics are comparable to the previous versions done in smogn within marginal differences (between 1e-12 - 1e-14) caused by different floating point errors in non-native functions.

Note scipy.spatial.distance is now a necessary import in over_sampling.py while HEOM.py and HVDM.py are included in the repository's smogn directory.