gregversteeg / gaussianize

Transforms univariate data into normally distributed data
MIT License
73 stars 25 forks source link

Random projections to gaussianize #2

Open ghost opened 7 years ago

ghost commented 7 years ago

Just pseudorandomly sign flip the data (and or randomly permute it) and run it through an (nlog(n)) Walsh Hadamard transform and the data takes on the Gaussian distribution. If the data is very sparse you might have to apply that idea twice to get to the Gaussian. https://randomprojectionai.blogspot.com/

gregversteeg commented 7 years ago

Thanks, the links talk a lot about generating Gaussian random numbers, but I don't see how we use it to transform data drawn from one distribution to a Gaussian. In particular, one thing we really want is an invertible function, which can also be applied to previously unseen data. But if it WH transform can be used in this way, I'd love to learn more.

ghost commented 7 years ago

The process is to pseudorandomly (recomputable) sign flip the numeric data, then take the fast Walsh Hadamard transform (O(nlog(n)) of that.  Obviously sign flipping is invertible and  the usual orderings of the WHT are self-inverse.And both operations leave vector length unchanged.  Hence the overall process is invertible. If you look at the matrix equivalent of the WHT what you get are just (orthogonal) patterns of addition and subtraction of the data to be transformed. Now the central limit theorem applies not just to the sum of n random variable, but also allows the subtraction operator to replace some of the addition operators.  The sign flipping turns the input data into essentially random variables, which then by the central limit theorem when subjected to the WHT take on the Gaussian distribution.To get a good approximation of the Gaussian in all cases you may need to repeat the process 2 or 3 times, but for most natural data 1 time is enough.

On Thursday, September 28, 2017 12:30 AM, Greg Ver Steeg <notifications@github.com> wrote:

Thanks, the links talk a lot about generating Gaussian random numbers, but I don't see how we use it to transform data drawn from one distribution to a Gaussian. In particular, one thing we really want is an invertible function, which can also be applied to previously unseen data. But if it WH transform can be used in this way, I'd love to learn more.— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

gregversteeg commented 7 years ago

Thanks, this makes sense but I think it differs in intent from the purpose of Gaussianizing data as expressed in the papers and methods linked here. Your method takes data and produces random variables drawn from a Gaussian distribution, but it doesn't preserve the order of the data. For example, suppose I have samples of data like this: [0.1, 0.2, 0.3, 0.4...1.0]. Empirically, the distribution looks uniform, but I would like one that looks Gaussian while preserving the order of the data. So a good Gaussianizer (for this purpose) will return something like [-0.63, -0.32, -0.2, -0.1,0,0.1, 0.2 ,0.32, 0.63] (numbers made up, but trying to make it Gaussian distributed). Whereas the Hadamard transform with random bit flips might give me the right distribution, but scrambled like [0.1, 0.63, -0.32, 0, ...].

ghost commented 7 years ago

That sounds a lot more difficult especially if you want to maintain accuracy during inversion.

On Thursday, September 28, 2017 7:36 AM, Greg Ver Steeg <notifications@github.com> wrote:

Thanks, this makes sense but I think it differs in intent from the purpose of Gaussianizing data as expressed in the papers and methods linked here. Your method takes data and produces random variables drawn from a Gaussian distribution, but it doesn't preserve the order of the data. For example, suppose I have samples of data like this: [0.1, 0.2, 0.3, 0.4...1.0]. Empirically, the distribution looks uniform, but I would like one that looks Gaussian while preserving the order of the data. So a good Gaussianizer (for this purpose) will return something like [-0.63, -0.32, -0.2, -0.1,0,0.1, 0.2 ,0.32, 0.63] (numbers made up, but trying to make it Gaussian distributed). Whereas the Hadamard transform with random bit flips might give me the right distribution, but scrambled like [0.1, 0.63, -0.32, 0, ...].— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.