Recommend which transform will produce the most Gaussian distribution

Could we look at features and targets and recommend suitable nonlinear transformations to make them more amenable to learning?

I think this should work:

scipy.stats.boxcox().

from scipy import stats

xt, lmbda = stats.boxcox(x)

xt is the transformed data, lmbda is the lambda parameter -- the value of lmbda that maximizes the log-likelihood function. The closer it is to 1, the more normal is the distribution. If it's 2, you should square the data, if it's 0.5, take the square root, etcetera.

UPDATE

Box-Cox only works on positive valued data. Turns out there's Yeo-Johnson, which is similar but works on negative data too.

Question: this should probably be done before standardizing the data? Not sure.

Turns out both Box-Cox and Yeo-Johnson are in sklearn too:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

scienxlab / redflag

Recommend which transform will produce the most Gaussian distribution #46