scienxlab / redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.
https://scienxlab.org/redflag
Apache License 2.0
21 stars 6 forks source link

Recommend which transform will produce the most Gaussian distribution #46

Open kwinkunks opened 1 year ago

kwinkunks commented 1 year ago

Could we look at features and targets and recommend suitable nonlinear transformations to make them more amenable to learning?

I think this should work:

scipy.stats.boxcox().

from scipy import stats

xt, lmbda = stats.boxcox(x)

xt is the transformed data, lmbda is the lambda parameter -- the value of lmbda that maximizes the log-likelihood function. The closer it is to 1, the more normal is the distribution. If it's 2, you should square the data, if it's 0.5, take the square root, etcetera.

UPDATE

Box-Cox only works on positive valued data. Turns out there's Yeo-Johnson, which is similar but works on negative data too.

Question: this should probably be done before standardizing the data? Not sure.

Turns out both Box-Cox and Yeo-Johnson are in sklearn too:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

kwinkunks commented 1 year ago

:bulb: Will need to consider that Redflag's stdev-based outlier detection won't work on features that need transformation... should apply transformation before deciding on outliers. Probably needs an issue.