from scipy import stats
xt, lmbda = stats.boxcox(x)
xt is the transformed data, lmbda is the lambda parameter -- the value of lmbda that maximizes the log-likelihood function. The closer it is to 1, the more normal is the distribution. If it's 2, you should square the data, if it's 0.5, take the square root, etcetera.
UPDATE
Box-Cox only works on positive valued data. Turns out there's Yeo-Johnson, which is similar but works on negative data too.
Question: this should probably be done before standardizing the data? Not sure.
Turns out both Box-Cox and Yeo-Johnson are in sklearn too:
:bulb: Will need to consider that Redflag's stdev-based outlier detection won't work on features that need transformation... should apply transformation before deciding on outliers. Probably needs an issue.
Could we look at features and targets and recommend suitable nonlinear transformations to make them more amenable to learning?
I think this should work:
scipy.stats.boxcox().
xt
is the transformed data,lmbda
is the lambda parameter -- the value oflmbda
that maximizes the log-likelihood function. The closer it is to 1, the more normal is the distribution. If it's 2, you should square the data, if it's 0.5, take the square root, etcetera.UPDATE
Box-Cox only works on positive valued data. Turns out there's Yeo-Johnson, which is similar but works on negative data too.
Question: this should probably be done before standardizing the data? Not sure.
Turns out both Box-Cox and Yeo-Johnson are in
sklearn
too:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html