Open Wrzlprmft opened 4 years ago
The bug in ma.corrcoef
also occurs in the related ma.cov
. They both involve calculating pairwise covariance terms between 1-D arrays a
and b
of the form np.sum( (a-a.mean())*(b-b.mean()) ) / n
. The difference between pandas
and ma
is use of different means. Both a
and b
can be masked, and the covariance term involving a
and b
effectively uses a mask that is the logical or of a
and b
's masks such that more array elements may be masked in the pairwise covariance term calculation that in the original arrays supplied to the function. ma
uses the means of the arrays supplied to the function whereas pandas
uses the more correct means of the combined-mask arrays.
Here is a modified version of the OP's example using cov
(which is simpler than corrcoef
), simpler data, and a function that reproduces both the ma
and pandas
results depending on the value of the kwarg original_mask
.
import numpy as np
import pandas as pd
def solve_ma(data):
return np.ma.cov(data)
def solve_pandas(data):
return pd.DataFrame(data).T.cov().values
def solve_manual(data, original_mask=True):
n = len(data)
cov = np.empty((n, n))
for i in range(n):
for j in range(n):
keep = ~np.logical_or(data.mask[i], data.mask[j])
a = data[i][keep]
b = data[j][keep]
mean_a = np.ma.mean(data[i] if original_mask else a)
mean_b = np.ma.mean(data[j] if original_mask else b)
cov_ab = np.sum((a-mean_a)*(b-mean_b)) / (len(a)-1)
cov[i, j] = cov_ab
return cov
def arr2str(array):
return np.round(array, 4).tolist()
data = np.ma.masked_invalid([[np.nan, 6, 3, 4], [1, np.nan, 9, 16]])
print('pandas ', arr2str(solve_pandas(data)))
print('manual(False)', arr2str(solve_manual(data, original_mask=False)))
print('numpy.ma ', arr2str(solve_ma(data)))
print('manual(True) ', arr2str(solve_manual(data, original_mask=True)))
It produces this output
pandas [[2.3333, 3.5], [3.5, 56.3333]]
manual(False) [[2.3333, 3.5], [3.5, 56.3333]]
numpy.ma [[2.3333, -2.8889], [-2.8889, 56.3333]]
manual(True) [[2.3333, -2.8889], [-2.8889, 56.3333]]
so original_mask=False
reproduces the pandas
results and original_mask=True
reproduces the ma
results.
To me it is not clear which method is better from a mathematical point of view. You could argue that the "wrong" method uses more data to estimate the mean and is therefore better.
I had a look at the R documentation. R implements three main different strategies for handling missing data, in short: producing a NA output, calculating with complete samples only and calculating with all complete pairs in each sample. How I understand it the means are then calculated over complete samples or complete pairs only. These come in variants which control the output ; an example is that instead of NA output it is also possible to raise an error.
The behavior in R is controlled by a keyword argument "use" which can take several values such as: "everything", "complete.obs", "pairwise-complete.obs".
So there are some design questions to solve here. Should the behavior be switchable? Should we ignore the partially incomplete points when calculating a mean?
In the following code, I calculate the correlation coefficients of an array with missing data in three ways:
ma.corrcoef
.corr
method of Pandas dataframes.np.corrcoef
to manually masked arrays.The results of the last two methods agree which suggests that they are correct. The result using masked arrays does not match, suggesting there is something wrong with it. Moreover, the computation of
ma.corrcoef
takes excessively long for larger arrays.Version information:
NumPy: 1.18.1 Python: 3.7.5 (default, Nov 20 2019, 09:21:52) [GCC 9.2.1 20191008]