numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
27.54k stars 9.84k forks source link

BUG: ma.corrcoef produces wrong results and takes very long. #15601

Open Wrzlprmft opened 4 years ago

Wrzlprmft commented 4 years ago

In the following code, I calculate the correlation coefficients of an array with missing data in three ways:

The results of the last two methods agree which suggests that they are correct. The result using masked arrays does not match, suggesting there is something wrong with it. Moreover, the computation of ma.corrcoef takes excessively long for larger arrays.

import numpy as np
from pandas import DataFrame

# Three implementations of the correlation coefficient for missing data

cc_masked = lambda data: np.ma.corrcoef(np.ma.masked_invalid(data))

cc_pandas = lambda data: DataFrame(data).T.corr().values

def cc_manual(data):
    n = len(data)
    correlation_matrix = np.empty((n,n))
    for i in range(n):
        for j in range(n):
            mask = np.isfinite(data[i]) & np.isfinite(data[j])
            corr = np.corrcoef( data[i][mask], data[j][mask] )
            correlation_matrix[i,j] = corr[0,1]
    return correlation_matrix

# Generate test data
n = 42
m = 137
q = 0.07

R = np.random.uniform(-1,1,(n,n))
C = R @ R.T
data = np.random.multivariate_normal(np.zeros(n),C,m).T
nan_mask = np.random.choice( [True,False], size=data.shape, p=[q,1-q] )
data[nan_mask] = np.nan

# Compare correlation coefficients
np.testing.assert_almost_equal( cc_manual(data), cc_pandas(data) )
np.testing.assert_almost_equal( cc_manual(data), cc_masked(data) )

Version information:

NumPy: 1.18.1 Python: 3.7.5 (default, Nov 20 2019, 09:21:52) [GCC 9.2.1 20191008]

ianthomas23 commented 3 years ago

The bug in ma.corrcoef also occurs in the related ma.cov. They both involve calculating pairwise covariance terms between 1-D arrays a and b of the form np.sum( (a-a.mean())*(b-b.mean()) ) / n. The difference between pandas and ma is use of different means. Both a and b can be masked, and the covariance term involving a and b effectively uses a mask that is the logical or of a and b's masks such that more array elements may be masked in the pairwise covariance term calculation that in the original arrays supplied to the function. ma uses the means of the arrays supplied to the function whereas pandas uses the more correct means of the combined-mask arrays.

Here is a modified version of the OP's example using cov (which is simpler than corrcoef), simpler data, and a function that reproduces both the ma and pandas results depending on the value of the kwarg original_mask.

import numpy as np
import pandas as pd

def solve_ma(data):
    return np.ma.cov(data)

def solve_pandas(data):
    return pd.DataFrame(data).T.cov().values

def solve_manual(data, original_mask=True):
    n = len(data)
    cov = np.empty((n, n))
    for i in range(n):
        for j in range(n):
            keep = ~np.logical_or(data.mask[i], data.mask[j])
            a = data[i][keep]
            b = data[j][keep]
            mean_a = np.ma.mean(data[i] if original_mask else a)
            mean_b = np.ma.mean(data[j] if original_mask else b)
            cov_ab = np.sum((a-mean_a)*(b-mean_b)) / (len(a)-1)
            cov[i, j] = cov_ab
    return cov

def arr2str(array):
    return np.round(array, 4).tolist()

data = np.ma.masked_invalid([[np.nan, 6, 3, 4], [1, np.nan, 9, 16]])

print('pandas       ', arr2str(solve_pandas(data)))
print('manual(False)', arr2str(solve_manual(data, original_mask=False)))
print('numpy.ma     ', arr2str(solve_ma(data)))
print('manual(True) ', arr2str(solve_manual(data, original_mask=True)))

It produces this output

pandas        [[2.3333, 3.5], [3.5, 56.3333]]
manual(False) [[2.3333, 3.5], [3.5, 56.3333]]
numpy.ma      [[2.3333, -2.8889], [-2.8889, 56.3333]]
manual(True)  [[2.3333, -2.8889], [-2.8889, 56.3333]]

so original_mask=False reproduces the pandas results and original_mask=True reproduces the ma results.

RonaldAJ commented 1 year ago

To me it is not clear which method is better from a mathematical point of view. You could argue that the "wrong" method uses more data to estimate the mean and is therefore better.

I had a look at the R documentation. R implements three main different strategies for handling missing data, in short: producing a NA output, calculating with complete samples only and calculating with all complete pairs in each sample. How I understand it the means are then calculated over complete samples or complete pairs only. These come in variants which control the output ; an example is that instead of NA output it is also possible to raise an error.

The behavior in R is controlled by a keyword argument "use" which can take several values such as: "everything", "complete.obs", "pairwise-complete.obs".

So there are some design questions to solve here. Should the behavior be switchable? Should we ignore the partially incomplete points when calculating a mean?