pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.72k stars 17.93k forks source link

numpy vs pandas: different estimation of covariance in presence of nan values #16837

Open asdf8601 opened 7 years ago

asdf8601 commented 7 years ago

Code Sample, a copy-pastable example if possible

# =============================================================================
# NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES
# =============================================================================
# data with nan values
M = np.random.randn(10,2)
# add missing values
M[0,0] = np.nan
M[1,1] = np.nan

# Covariance matrix calculations
# ==============================
# numpy
# -----
masked_arr = np.ma.array(M, mask=np.isnan(M))
cov_numpy = np.ma.cov(masked_arr, rowvar=0, allow_masked=True, ddof=1).data

# pandas
# ------
cov_pandas = pd.DataFrame(M).cov(min_periods=0).values

# Homemade covariance coefficient calculation
# (what each of them is actually doing)
# =============================================
# select elements to estimate the covariance matrix element (0,1)
x = M[:,0]
y = M[:,1]

mask_x = ~np.isnan(x)
mask_y = ~np.isnan(y)
mask_common = mask_x & mask_y

# numpy
# -----
xn = x-np.mean(x[mask_x])
yn = y-np.mean(y[mask_y])
cov_np = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

# pandas
# ------
xn = x-np.mean(x[mask_common])
yn = y-np.mean(y[mask_common])
cov_pd = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

Problem description

I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.

chris-b1 commented 7 years ago

Thanks for the detailed example, seems very closed related to #3513 (possibly a duplicate)

chris-b1 commented 7 years ago

FWIW, R seems to take a similar approach

np.random.seed(42)
<...>
In [58]: M
Out[58]: 
array([[        nan, -0.1382643 ],
       [ 0.64768854,         nan],
       [-0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473],
       [-0.46947439,  0.54256004],
       [-0.46341769, -0.46572975],
       [ 0.24196227, -1.91328024],
       [-1.72491783, -0.56228753],
       [-1.01283112,  0.31424733],
       [-0.90802408, -1.4123037 ]])

In [59]: cov_pd
Out[59]: 0.22724929787316234

R

a = c(NA,  0.64768854, -0.23415337,  1.57921282, -0.46947439, 
      -0.46341769,  0.24196227, -1.72491783, -1.01283112, -0.90802408)
b = c(-0.1382643 ,   NA, -0.23413696,  0.76743473,  0.54256004,  -0.46572975,
      -1.91328024, -0.56228753,  0.31424733, -1.4123037 )
cov(a, b, use='pairwise')

# [1] 0.2272493

https://stats.stackexchange.com/q/110719

jreback commented 7 years ago

since this has a concrete example, will close #3513

masdeseiscaracteres commented 7 years ago

I think there are two different (but closely related) issues in here:

  1. The one referred to by #3513. That is, the functions in both pandas and numpy use the so-called "pairwise deletion" strategy which could lead to non positive semi-definite covariance matrices. An even simpler example than the above demonstrates the problem:
import pandas as pd
A = pd.DataFrame([[1, 2],[None, 4],[5, None],[7, 8]])
cov_pd = A.cov()

           0          1
0   9.333333  18.000000
1  18.000000   9.333333

Note that cov_pd is not positive semi-definite:

import numpy as np
np.linalg.eigvals(cov_pd)
    array([ 27.33333333,  -8.66666667])

For this very same example Numpy does this:

masked_A = np.ma.MaskedArray(A.values, np.isnan(A))
cov_np = np.ma.cov(masked_A, rowvar=0).data
    array([[  9.33333333,  17.77777778],
           [ 17.77777778,   9.33333333]])

which, again, is not positive semi-definite:

np.linalg.eigvals(cov_np)
    array([ 27.11111111,  -8.44444444])

BTW, Matlab function cov handles three cases via the nanflag argument:

and does, by far, the best job documenting the difference. Calculation matches Pandas':

nanflag = 'partialrows'
cov_m = cov(A, 0, nanflag)
    cov_m =

        9.3333   18.0000
       18.0000    9.3333
  1. Another different story is the one in which numpy and others (Pandas, R, Matlab) differ, and that is the number of elements they use to estimate the mean of each column. This raises concerns about estimation error and biased estimations (we have seen above that it loses the positive semi-definiteness property as well).

It is not clear to me, at this moment, which implementation is more reasonable since Numpy's may be more precise.

chris-b1 commented 7 years ago

Thanks - not something I'm deeply knowledgeable about, but at minimum would definitely take some expanded docs warning that cov in the presence of missing data should be interpreted carefully.