raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.64k stars 139 forks source link

pingouin.multivariate_normality() out of memory due to large array #121

Closed jckrumm closed 2 months ago

jckrumm commented 4 years ago

When I give a large array of data to multivariate_normality(), I get an error about insufficient memory. Specifically, for an input with n sample data vectors, multivariate_normality() seems to be trying to allocate an array of n X n. So when n = 1,000,000, there is not enough memory. Below is some simple Python code that reproduces the error (note that GitHub is converting my Python comment lines into huge headers):

import numpy as np import pingouin as pg

Big array of 2D vectors

array = np.random.rand(1000000,2) print(array.shape)

Test for multivariate normality

pg.multivariate_normality(array)

The error message says, "Exception has occurred: MemoryError Unable to allocate 7.28 TiB for an array with shape (1000000, 1000000) and data type float64".

When I look at the equation for computing the test statistic at https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Multivariate_normality_tests, it doesn't appear that it's necessary to allocate a big array like this.

I'm attaching screen shots of the error message and the equation for the test statistic from Wikipedia.

error test-statistic

raphaelvallat commented 4 years ago

Thanks for opening the issue @jckrumm! For this function, I've essentially translated into Python pre-existing R and Matlab implementations (see here and here), which I believe are already quite optimized (e.g. no for loops), at least when working with "regular" arrays. And if I'm honest, I don't think I'll be able to come up with a more optimized implementation just based on the original equations.

Therefore, in your case, I think one of the best solutions is to reduce the type of the original array from float64 to float32 or float16 to reduce memory usage. For instance, although it took some time, I was able to run the following lines on my system (MacBook, 16 GB RAM):

import numpy as np

np.random.seed(123)
X = np.random.rand(1000000, 2).astype(np.float16)
n, p = X.shape

# Covariance matrix
S = np.cov(X, rowvar=False, bias=True)
S_inv = np.linalg.pinv(S).astype(X.dtype)
difT = X - X.mean(0)

# Squared-Mahalanobis distances
# The lines below are the ones causing the MemoryError in the original function
Dj = np.diag(np.linalg.multi_dot([difT, S_inv, difT.T]))
Y = np.linalg.multi_dot([X, S_inv, X.T])
Djk = -2 * Y.T + np.repeat(np.diag(Y.T), n).reshape(n, -1) + \
      np.tile(np.diag(Y.T), (n, 1))

Can you let me know if it works for you?

If it doesn't, maybe there's some issues with memory overcommitment?

Thanks, Raphael

jckrumm commented 4 years ago

Hi Raphael,

Thanks for your thoughtful reply on this. I appreciate the help. Unfortunately, my Python tries to allocate a float64 array even if the input to multivariate_normality is a float16 array. So even with your perfectly reasonable suggestion, my PC is trying to allocate the same 7.28 TB array. The screenshots are attached for the code before and after the .astype(np.float16) modification. Are you suggesting I should instead copy your code and modify to enforce the input type in the intermediate computations?

My suspicion is that your code is CPU efficient (with no “for” loops), but not necessarily memory efficient. I’m guessing it’s allocating the big array to compute the term that I have circled in the attached BHEP image file, which is a screenshot from https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Multivariate_normality_tests . This is basically computing the sum of squared distances between all the input points. (Disregarding the covariance scaling and the exponentiation. It’s just a big double summation over all possible pairs of points.) If you were writing this in C without any extra linear algebra package, you’d just do it with nested “for” loops and an accumulating sum. You wouldn’t necessarily precompute all the distances between all the points, but instead compute them on the fly within the inner loop and then sum. You have clever code for computing this big sum, but under the covers I’m guessing it’s allocating this huge array that isn’t really necessary.

John before after BHEP

raphaelvallat commented 4 years ago

Hi @jckrumm,

Thanks for your detailed response!

Are you suggesting I should instead copy your code and modify to enforce the input type in the intermediate computations?

Absolutely. The current implementation in Pingouin uses float64 regardless of the input. That's why I've extracted (and updated) a few relevant lines from the function for you to try.

If you were writing this in C without any extra linear algebra package, you’d just do it with nested “for” loops and an accumulating sum. You wouldn’t necessarily precompute all the distances between all the points, but instead compute them on the fly within the inner loop and then sum.

I see your point. I think that by using a Python "for loop" implementation we would end up with an extremely slow code, so one would maybe have to use some Numba-optimization. Alternatively, do you know other multivariate tests that do not require such computations?

One last suggestion, what about under-sampling your array? Given the huge amount of data I'm guessing that taking only, say, one-fourth, or even one-tenth of the data at random should not compromise too much the final results (?).

Thanks, Raphael

raphaelvallat commented 4 years ago

I'm closing this issue for now, but please feel free to reopen. Thanks

avm19 commented 2 years ago

I have encountered the same problem for a much smaller input, of shape (19000, 7). I don't know what size of array the code was trying allocate.

I see in the source code that very large arrays are currently not tested for, with input of shape (1000, 100) being the largest.

Screenshot 1
Screenshot 2
raphaelvallat commented 2 years ago

Thanks @avm19! Ok I had a closer look into the source code, and I think we can get better performance (and similar results) by replacing

https://github.com/raphaelvallat/pingouin/blob/adf07187bae972a14a47b3c7c85826d4b7a7b67c/pingouin/multivariate.py#L81

with

Y_diag = np.diag(Y)
Djk = -2 * Y.T + Y_diag + Y_diag[..., None]

I'll work on a PR in the next days/weeks