Work out a better model and include all towns

ramanshah commented 4 years ago

The posterior changes dramatically from 0 deaths to <5 deaths, particularly for the tiny towns that are most likely to have these death counts. When there are zero deaths, the posterior looks insanely different than any of the others. I cut off the smallest towns to make the dashboard for the README, but avoiding such ad-hocery is exactly why one goes Bayesian in the first place. The current dashboard snapshot is incomplete and unsatisfying.

Research the canonical way to judge the quality of these intervals (likely through cross validation and coverage testing). Use this work to do some more model development for the interval construction. I may have to let go of the simplistic Beta paradigm to do a good job of dealing with the data suppression. For example, a Poisson process, unlike the Beta, would give the likelihood of a <5 directly.

Consider borrowing strength among years (these are in the dashboards but I'd have to change the scraping/ETL) or among cities (such as with an Empirical Bayes prior specification) to improve the intervals.

The goal should be that going up from 0 to <5 doesn't "violently" change the posterior interval.

ramanshah commented 4 years ago

I wonder if the world needs an efficient HPD solver for numerical versions of this problem. In recent weeks I've daydreamed about a different model (something like a lognormal prior, giving 1-2 orders of magnitude of leeway around the national overdose death rate of 14.3 deaths/100k, with a Poisson likelihood). Getting away from a nice conjugate prior family makes things quickly hairy. Gelman in a recent blog post outlined minimum-length intervals.
How about Poisson-Gamma math? Can I work the <5 data into this exactly?
The biggest problem here is a bad prior. The flat Beta(1, 1) prior assigns 90% of the probability mass to at least 10% of the population dying of an overdose (>=10000 deaths per 100k), which is ridiculous. Choosing Beta (a, b) with reasonable a and b to give a narrower range of reasonable death rates would fix the "violent" changing of the posterior interval, including the towns with zero events with highest density at zero and New Shoreham with an absurd right endpoint.
Borrowing strength through, e.g., hierarchical models moves away from the "dashboardy" spirit of this work. It's a big step away from business intelligence and toward bespoke one-off statistical analysis.

ramanshah commented 3 years ago

Reviewing what I've done since creating this issue, I have an informative Beta prior that helps the results not be so quite so ridiculous. I'd need to thread this back through Tableau Public and pull a fresh screenshot that includes all towns.

The idea to use a Poisson likelihood (either with a Gamma prior or some non-conjugate prior with numerical solution) remains sensible, even after reading much of McElreath. It remains attractive to be able to use the <5 symbol directly. A key test case in the sort is New Shoreham/Block Island, which is tiny and shows <5. I'd guess that in such a tiny population, the number of deaths was probably 1. Imputing the symbol to 2.5, as we're now doing, is overstating the overdose problem on Block Island.

ramanshah commented 3 years ago

Data is now much better, with a tidy table including year-over-year history vs scraping a single year from PDFs:

https://ridoh-overdose-surveillance-rihealth.hub.arcgis.com/datasets/municipal-count-of-opioid-involved-fatal-overdose-by-year-resident-municipality/explore

Pushed to branch use_history.

Using this to the fullest would involve a Poisson model and an assumption of slow time variation on the event rate parameter. Possibly even a constant event rate parameter or some treatment of over-dispersion?

ramanshahdatascience / interval_sorting_demo

Work out a better model and include all towns #1