Open asdf8601 opened 7 years ago
Thanks for the detailed example, seems very closed related to #3513 (possibly a duplicate)
FWIW, R
seems to take a similar approach
np.random.seed(42)
<...>
In [58]: M
Out[58]:
array([[ nan, -0.1382643 ],
[ 0.64768854, nan],
[-0.23415337, -0.23413696],
[ 1.57921282, 0.76743473],
[-0.46947439, 0.54256004],
[-0.46341769, -0.46572975],
[ 0.24196227, -1.91328024],
[-1.72491783, -0.56228753],
[-1.01283112, 0.31424733],
[-0.90802408, -1.4123037 ]])
In [59]: cov_pd
Out[59]: 0.22724929787316234
R
a = c(NA, 0.64768854, -0.23415337, 1.57921282, -0.46947439,
-0.46341769, 0.24196227, -1.72491783, -1.01283112, -0.90802408)
b = c(-0.1382643 , NA, -0.23413696, 0.76743473, 0.54256004, -0.46572975,
-1.91328024, -0.56228753, 0.31424733, -1.4123037 )
cov(a, b, use='pairwise')
# [1] 0.2272493
since this has a concrete example, will close #3513
I think there are two different (but closely related) issues in here:
import pandas as pd
A = pd.DataFrame([[1, 2],[None, 4],[5, None],[7, 8]])
cov_pd = A.cov()
0 1
0 9.333333 18.000000
1 18.000000 9.333333
Note that cov_pd
is not positive semi-definite:
import numpy as np
np.linalg.eigvals(cov_pd)
array([ 27.33333333, -8.66666667])
For this very same example Numpy does this:
masked_A = np.ma.MaskedArray(A.values, np.isnan(A))
cov_np = np.ma.cov(masked_A, rowvar=0).data
array([[ 9.33333333, 17.77777778],
[ 17.77777778, 9.33333333]])
which, again, is not positive semi-definite:
np.linalg.eigvals(cov_np)
array([ 27.11111111, -8.44444444])
BTW, Matlab function cov handles three cases via the nanflag
argument:
nanflag = 'includenan'
nanflag = 'omitrows'
nanflag = 'partialrows'
and does, by far, the best job documenting the difference. Calculation matches Pandas':
nanflag = 'partialrows'
cov_m = cov(A, 0, nanflag)
cov_m =
9.3333 18.0000
18.0000 9.3333
It is not clear to me, at this moment, which implementation is more reasonable since Numpy's may be more precise.
Thanks - not something I'm deeply knowledgeable about, but at minimum would definitely take some expanded docs warning that cov
in the presence of missing data should be interpreted carefully.
Code Sample, a copy-pastable example if possible
Problem description
I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.