probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
923 stars 63 forks source link

Pearson correlation probability might be incorrect (correlation_p_pearsonr2) #326

Closed jostheim closed 8 years ago

jostheim commented 8 years ago

From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

t = r\sqrt{\frac{n-2}{1 - r^2}}

from the correlation_p_pearsonr2:

def correlation_pearsonr2(data0, data1):
    return stats.pearsonr(data0, data1)**2

def correlation_p_pearsonr2(data0, data1):
    correlation = correlation_pearsonr2(data0, data1)
    if math.isnan(correlation):
        return float('NaN')
    if correlation == 1.:
        return 0.
    n = len(data0)
    assert n == len(data1)
    # Compute observed t statistic.
    t = correlation * math.sqrt((n - 2)/(1 - correlation**2))
    # Compute p-value for two-sided t-test.
    return 2 * stats.t_cdf(-abs(t), n - 2)

The function correlation_pearsonr2 is returning the square of the pearsonr2 (verified this against the scipy pearsonr2, which agrees with the bayeslite pearsonr2), but then in correlation_p_pearsonr2 the square is getting squared again! So now the correlation value is to the 4th power. I believe this is incorrect, I think the formula from wikipedia was expecting an unsquared correlation coefficent.

This would also fix the fact that I found the scipy version of pearsonr to have the same correlation value but wildly different p-values. Of course I might just have misunderstood what the code is doing.

riastradh-probcomp commented 8 years ago

Fixed in 58f7fa8d880d82a38d77edcac33ac92688e50087.