karimn / covid-19-transmission

0 stars 0 forks source link

Informative prior for imputed cases #16

Open karimn opened 4 years ago

karimn commented 4 years ago

@wwiecek found that we are hitting convergence problems because of non-identifiability of R0 and imputed cases, so setting a more informative (restrictive) prior on imputed cases eliminates some of these issues.

We also discussed using a prior for imputed cases that is conditional of the population size of sub-regions.

@wwiecek I'm opening this issue so we can keep track what we discussed as a solution. Please add anything you think is relevant here. I'm assigning to you, but feel free to re-assign back to me if you have a clear implementation of these priors you want me to do. I'm going to go ahead and do a cluster run with the new restricted priors you checked in.

wwiecek commented 4 years ago

I do not think I changed the defaults, only moved the fixed value of 0.03 into YAML? (Do not remember now.) Important to check before running.

I think we said the prior should have a mean in proportion to population size, but we just need to decide on proportionality constant and a distribution, right?

In any case let's cluster run in the meantime because it's free lunch. We may learn something. But I have to say that setting it higher may mess with some regions that should have high imputed cases. BTW I generated some values from this exp(1/exp(0.03)) (I think that's what they have?), it's narrower than I thought

y1 <- rexp(1e06, 0.03)
y2 <- sapply(y1, function(x) rexp(1, 1/x))
hist(y1)
hist(y2)
summary(y1)
summary(y2)
sum(y1 > 1000)
sum(y2 > 1000)

but obviously data will fight this prior so you may see high imputed cases often even if you set to something higher than 0.03

wwiecek commented 4 years ago

PS: So I made a back of envelope calculation. It seems that to record 10 deaths on day 30 you typically would need hundreds of cases in week 1 (so imputed_cases somewhere in 10-100 range). So I don't have a problem with their 0.03 value in principle, but we know that it is a problem.

To get us started, if you use cut-off of 3 deaths I'd multiply it by 3 to 0.09. And then multiply by 2 for a good measure just to see what happens. So let's say 0.18 instead of 0.03 for the test cluster run. If the results are still not great for some regions, I'd then re-run with 0.5 just to see what happens.

Basic calc is here for posterity (please disregard):

y <- 200
imputed_c <- rep(y, 7)
i <- c(imputed_c, 3*imputed_c, 9*imputed_c, 27*imputed_c)
ifr <- .0066
sum(i*ifr*sapply(28:1, function(i) pgamma(i, 17, .5)))
wwiecek commented 4 years ago

So, let's compare Poland fitted with low imputed cases (0.18, as disc. above) and fitted with 0.03, the original param

#PL with 0.18
> print(mob_fit, "imputed_cases")
Inference for Stan model: mobility.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean se_mean    sd  2.5%    25%    50%    75%  97.5% n_eff Rhat
imputed_cases[1]  210.67    1.61 72.91 94.80 158.72 200.50 252.45 377.27  2045 1.01
imputed_cases[2]   37.52    0.34 22.38  8.94  21.73  32.79  47.78  93.66  4407 1.00
imputed_cases[3]  178.82    1.53 61.26 84.39 135.13 170.93 213.78 324.61  1608 1.01
imputed_cases[4]  186.01   10.09 88.06 69.85 129.21 168.72 222.53 417.24    76 1.03
imputed_cases[5]  141.73    5.38 48.56 54.92 109.24 138.98 169.17 250.18    81 1.03
imputed_cases[6]  182.48    1.38 64.16 84.77 136.29 172.80 218.56 332.77  2152 1.01
imputed_cases[7]   89.70    0.57 35.00 36.92  65.25  84.00 108.44 173.97  3812 1.00
imputed_cases[8]   27.60    0.39 23.82  2.25  11.11  20.69  37.01  89.74  3641 1.00
imputed_cases[9]  208.86    1.24 75.30 92.69 153.49 197.64 252.31 382.82  3673 1.00
imputed_cases[10]  92.28    0.67 33.81 42.48  69.50  86.64 109.40 177.37  2553 1.00
imputed_cases[11] 111.66    0.92 57.34 33.77  71.15 101.26 140.27 248.97  3909 1.00
imputed_cases[12] 119.46    1.42 40.22 51.40  92.06 115.14 141.55 213.43   804 1.01
imputed_cases[13] 115.11    0.94 56.43 35.34  73.71 105.78 145.81 248.57  3620 1.00

##PL with 0.03
Samples were drawn using NUTS(diag_e) at Wed Jul  1 11:49:56 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> print(fit1, "imputed_cases")
Inference for Stan model: mobility.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean se_mean     sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
imputed_cases[1]  331.29   45.67 182.91 124.81 227.63 290.72 375.89 869.46    16 1.09
imputed_cases[2]   51.30    0.46  28.23  14.14  31.06  45.30  64.49 120.29  3743 1.00
imputed_cases[3]  246.89    1.27  81.53 124.25 188.96 235.11 290.87 439.49  4110 1.00
imputed_cases[4]  275.50   45.77 163.64  94.41 183.01 236.27 314.03 735.91    13 1.10
imputed_cases[5]  179.81    9.75  76.76  76.27 128.97 167.12 212.44 377.42    62 1.04
imputed_cases[6]  240.89    1.25  79.97 115.51 183.48 230.01 287.77 429.29  4065 1.00
imputed_cases[7]  128.48    0.68  46.35  56.95  95.69 122.09 154.69 235.19  4681 1.00
imputed_cases[8]   17.08    3.55  15.87   1.99   6.94  12.25  21.55  60.75    20 1.07
imputed_cases[9]  269.26    6.07 100.96 123.40 195.76 253.71 322.35 505.82   277 1.02
imputed_cases[10] 146.14   19.43  53.96  73.81 108.60 134.38 171.35 288.38     8 1.17
imputed_cases[11] 103.36   17.98  60.27  31.43  61.43  88.30 129.23 263.49    11 1.11
imputed_cases[12] 188.23   28.77  73.90  90.02 137.68 172.23 219.01 384.26     7 1.21
imputed_cases[13]  98.89   18.98  58.26  26.55  57.40  85.72 125.63 249.22     9 1.14

Samples were drawn using NUTS(diag_e) at Sat Jul  4 13:34:28 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

True, neither does converge, but you can see that the one with weaker right tail does better. To me this would indicate that 1) we should have these more restrictive priors and in short term work on making them suited to pop size 2) there may be another problem, not related to imputed_cases.

If I plotted log_R0, the problem is similar, you can basically see that inference gets "stuck" for some regions but not others. Same for many other parameters.

Next step: fit the bad regions and see what might be causing problems.

wwiecek commented 4 years ago

@karimn LMK what you think about the above -- I'm still working on slightly different things (vaccines) so probably won't do more thank thinker with it until next week

karimn commented 4 years ago

Sorry, let me ask a couple clarifying questions. It seems like we're dealing with a degenerate model that can't identify whether the imputed cases should be high or R0, right? We therefore use our informative priors based on expertise to require that the imputed cases should be lower, right?

wwiecek commented 4 years ago

Yes. The part of the model that has R0 and imputed_cases is not completely unidentified, it's just poorly identified, especially because R changes over time.