k-sys / covid-19

A collection of work related to COVID-19
1.38k stars 434 forks source link

`highest_density_interval` fails on the widest interval #29

Open gkossakowski opened 4 years ago

gkossakowski commented 4 years ago

The highest_density_interval fails to handle the case when the interval to return should be the whole range.

E.g.

foo = pd.Series([1/2, 1/3, 1/6])
highest_density_interval(foo, p=0.9)

fails. The problem is that total_p doesn't have an element corresponding to the total sum of all probabilities. It's due to a vectorized version of off-by-one error.

I'll send a PR with a fix.

kmedved commented 4 years ago

I'd be interested in this fix, as I've been having issues with this function with certain data configurations.

dsfulf commented 4 years ago

I fixed this issue in #14, which was closed and not merged. It returns a default interval of (0, len(n) - 1), and is also quite a bit faster than the current HDI function.

For fun I also added the map-tiled plot:

`map plot`

LakshMatai commented 4 years ago

I don't know if it's fixed or not yet, I was still facing this issue in my code. I created a PR for it https://github.com/k-sys/covid-19/pull/56

eye4got commented 1 month ago

For anyone still interested, neither of the above results yield the "highest density" credible interval, since they return the entire interval (which misses the point of the credible interval). The current highest_density_interval() fails occasionally because it excludes the probability that Rt=0 from any generated credible interval. This occurs because the generated cdf starts at P(Rt = 0), rather than 0. So when P(Rt = 0) > (1-p), total_p has no values greater than p. Thus we see "ValueError: attempt to get argmin of an empty sequence".

Changing the definition of cumsum within highest_density_interval() as follows fixes the issue:

cumsum = np.zeros(pmf.shape[0] + 1)  
cumsum[1:] = pmf.cumsum()  

You'll then also need to the definition of high within the same function, to catch the IndexError we've now created at the opposite end of the vector.

high = pmf.index[highs[best]] if highs[best] < pmf.shape[0] else 1

Results of the function for the practice example had the lower bound increased by 0.01 and the upper bound increased by 0.01. I believe this is because by shifting the cdf but not the indices (e.g. lows[best]), we are effectively including the probability of the lowest value of Rt within the range.