strawlab / best

Bayesian estimation supersedes the t test
MIT License
145 stars 37 forks source link

calculating HDI #1

Open clerian opened 8 years ago

clerian commented 8 years ago

Hello,

in the function hdi_of_mcmc, could it be that the standard configuration actually calculates a slightly off HDI?

let's assume we want to calculate the 0.95 interval, then the two lines would subtract the remaining 5% two times and thus calculating the 0.9 interval?

ci_idx_inc = int(np.floor( cred_mass*len(sorted_pts) ))
n_cis = len(sorted_pts) - ci_idx_inc

or am I mistaken somewhere?

Best regards, Clemens Sauerzopf

treszkai commented 5 years ago

@clerian

Short answer: the code is correct.

>>> hdi_of_mcmc(st.norm(0,1).rvs(1_000_000))
(-1.9924611632940947, 1.9300090335912563)

>>> import scipy.stats as st
>>> st.norm(0,1).ppf([0.025, 0.975])
array([-1.95996398,  1.95996398])

>>> st.norm(0,1).ppf([0.05, 0.95])
array([-1.64485363,  1.64485363])

An annotated version of the function (comments mine):

def hdi_of_mcmc( sample_vec, cred_mass = 0.95 ):
    # sort the samples
    sorted_pts = np.sort( sample_vec )

    # the credible interval (CrI) contains ci_idx_inc samples
    ci_idx_inc = int(np.floor( cred_mass*len(sorted_pts) ))

    # the CrI excludes n_cis samples
    n_cis = len(sorted_pts) - ci_idx_inc

    # ci_width contains the widths of all candidate CrIs,
    #  i.e. the distance between the i-th and (95%+i)-th sample
    ci_width = sorted_pts[ci_idx_inc:] - sorted_pts[:n_cis]

    # The shortest interval in ci_width is the one with highest density,
    #  and its indexed by min_idx...
    min_idx = np.argmin(ci_width)

    # ...which means the HDI starts at hdi_min...
    hdi_min = sorted_pts[min_idx]
    # ...and lasts until ci_idx_inc samples later.
    hdi_max = sorted_pts[min_idx+ci_idx_inc]
    return hdi_min, hdi_max