parksw3 / epidist-paper

Other
10 stars 4 forks source link

Non-uniform delays between discrete and continuous time events #21

Closed parksw3 closed 1 year ago

parksw3 commented 1 year ago

See test_censor.R

Consider a case in which the incidence is growing at rate r between time 0 and t such that i(t) = i(0) exp(rt). Then, the distribution of event time between 0 and t is proportional to i(t). In this case, the mean event time is

$$ \frac{\exp(rt) (rt - 1) + 1}{r (\exp(rt)-1)}. $$

One thing to note is that this depends on $t$. So if we're dealing with a daily scale, assuming $r=0.2$, we get 0.52, which is very similar to 0.5. So we're not very far from the uniform distribution.

For example, consider a scenario in which the primary event is increasing at rate r=0.2. Then, the corresponding secondary event from this cohort increases at the same rate initially and then decreases exponentially.

Rplot

If we look at the difference between continuous and discrete event time for both ptime and stime, they seem to match up with what we predict during the growth phase.

Rplot01

And during the "decay" phase, we see that the differences are lower than 0.5 for stime.

Now, consider a case we've truncated the data so that both primary and secondary events are growing exponentially. This is also more realistic and closer to the data we're getting.

Rplot02

In this case, our predictions match up for the most part except for the ptime near the end. Since truncation is happening based on the secondary event, the incidence of primary event starts shrinking rapidly near the truncation time:

Rplot03

This translates to a low mean between the incidence of primary and secondy events. So if we tried to fit the uniform distribution, we might do OK for the most part except for the end.

Now, this is the problem that's going to affect the uniform case as well. Even if the incidence of primary event (e.g., infection) is stable (r=0), if we truncate based on the secondary event (e.g., symptom onset), primary event is going to decrease, leading to biases as we see here:

Rplot04

In this case, we're also seeing some left truncation happening in the secondary event time. So for example, if we took everyone who got infected and developed symptoms in November, this is the kind of pattern we'll see even if the growth rate for the entire year is stable. It might be tempting to try to filter based on ptime to keep the incidence of ptime stable, but then the incidence of stime will not be stable.

Bottom line is that no matter what cohort we pick, it's impossible for i_p(t) and i_s(t) to both have qualitatively the same incidence patterns, which can cause the distribution of the differences between the continuous and discrete event time to systematically differ from the uniform distribution (especially for the cohorts near the edge).

How much does this matter for the overall inference? Harder to say... depends a lot on the distribution itself too (matters more for shorter distributions).

parksw3 commented 1 year ago

20 sims of 200 samples with uniform cases. fitting latent variable 20 times.

Rplot05

90% coverage for meanlog, 100% coverage for sdlog. 95% coverage for mean. 100% coverage for sd. Not bad at all after all...

seabbs commented 1 year ago

This is very cool. Essentially we are getting the same truncation effects globally within a day which makes sense. Played around with different growth rates and distributions. Becomes more/less of an issue as you would expect (we should probably show this).

So this is the source of the negative correlation observed in the latent model? i.e #17

This feels like it can be a second section. So we lead with a comparison of methods and then bring this in for the "best" method and explore in some detail?

Was there any evidence of bias across multiple fits? Visually it looks perhaps like its over-estimating the mean.

seabbs commented 1 year ago

Just had a quick look at bias here.

https://github.com/parksw3/dynamicaltruncation/blob/main/scripts/test_uniform.Rv

Looks like some issues are present despite coverage being quite good

   parameter   mad  bias   dss  crps log_score ae_median se_mean
      <fctr> <num> <num> <num> <num>     <num>     <num>   <num>
1:   meanlog 0.042  0.42  -4.6 0.033     -1.30     0.044 0.00340
2:     sdlog 0.031 -0.36  -5.9 0.018     -2.00     0.026 0.00088
3:      mean 0.350  0.28  -1.0 0.220      0.44     0.300 0.18000
4:        sd 0.390 -0.11  -1.1 0.190      0.36     0.260 0.12000
parksw3 commented 1 year ago

What's the unit of bias here? Based on the figure, I can't figure out what 0.42 could mean for meanlog. I looked at https://cran.r-project.org/web/packages/scoringutils/scoringutils.pdf but am pretty confused... would it be better to express it in terms of something like relative bias rather than quantile bias?

seabbs commented 1 year ago

Bias is as defined here

Screenshot 2022-11-08 at 13 35 08

I reran this and am getting a consistent bias towards under-prediction (maybe this implies I have introduced yet another bug).

r$> scores
   parameter   mad  bias   dss  crps log_score ae_median se_mean
      <fctr> <num> <num> <num> <num>     <num>     <num>   <num>
1:   meanlog 0.041 -0.13 -5.40 0.023     -1.80     0.034 0.00160
2:     sdlog 0.032 -0.11 -6.10 0.015     -2.10     0.019 0.00066
3:      mean 0.330 -0.13 -1.20 0.180      0.33     0.260 0.09800
4:        sd 0.390 -0.13 -0.76 0.190      0.46     0.240 0.11000

plot (2)

parksw3 commented 1 year ago

Very very weird. I tried it without truncation (when everything really should be uniform) and am getting a consistent overestimation. I wonder if this is because there isn't truncation but the truncation is still being estimated? Let me break this down a bit more.

FYI, this thing now runs super super fast so I feel like doing a full simulation shouldn't take long.

> scores
   parameter   mad  bias   dss   crps log_score ae_median se_mean
      <fctr> <num> <num> <num>  <num>     <num>     <num>   <num>
1:   meanlog 0.036 0.370  -5.9 0.0180    -2.000     0.028 0.00100
2:     sdlog 0.026 0.065  -6.9 0.0099    -2.500     0.014 0.00027
3:      mean 0.270 0.370  -1.9 0.1400    -0.042     0.200 0.05900
4:        sd 0.300 0.230  -1.9 0.1300    -0.048     0.170 0.05100

Rplot11

seabbs commented 1 year ago

I wonder if this is because there isn't truncation but the truncation is still being estimated? Let me break this down a bit more.

This sounds quite plausible.

I think we are covering this by looking at short delays and the residual bias our estimates still have in this setting (i.e it falls in our current simulation grid? I do think we may want some kind of figure to explain the divergence from the uniform we would expect in different growth settings as a nice way of highlighting the issue. Flagging for discussion in the paper.

parksw3 commented 1 year ago

Sorry, I made too many issues. This is now discussed in https://github.com/parksw3/dynamicaltruncation/issues/27.

And yes, we both agreed that we wanted to add a figure to explain these features.