probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
Apache License 2.0
923 stars 63 forks source link

Pearson correlation probability might be incorrect (correlation_p_pearsonr2) #326

Closed jostheim closed 8 years ago

jostheim commented 8 years ago


t = r\sqrt{\frac{n-2}{1 - r^2}}

from the correlation_p_pearsonr2:

def correlation_pearsonr2(data0, data1):
    return stats.pearsonr(data0, data1)**2

def correlation_p_pearsonr2(data0, data1):
    correlation = correlation_pearsonr2(data0, data1)
    if math.isnan(correlation):
        return float('NaN')
    if correlation == 1.:
        return 0.
    n = len(data0)
    assert n == len(data1)
    # Compute observed t statistic.
    t = correlation * math.sqrt((n - 2)/(1 - correlation**2))
    # Compute p-value for two-sided t-test.
    return 2 * stats.t_cdf(-abs(t), n - 2)

The function correlation_pearsonr2 is returning the square of the pearsonr2 (verified this against the scipy pearsonr2, which agrees with the bayeslite pearsonr2), but then in correlation_p_pearsonr2 the square is getting squared again! So now the correlation value is to the 4th power. I believe this is incorrect, I think the formula from wikipedia was expecting an unsquared correlation coefficent.

This would also fix the fact that I found the scipy version of pearsonr to have the same correlation value but wildly different p-values. Of course I might just have misunderstood what the code is doing.

riastradh-probcomp commented 8 years ago

Fixed in 58f7fa8d880d82a38d77edcac33ac92688e50087.