parksw3 / epidist-paper

Other
10 stars 4 forks source link

Diagnosing case study #34

Closed parksw3 closed 1 year ago

parksw3 commented 1 year ago

Figures for the case study results look pretty wrong. For example, even for the naive case, we're getting huge differences in estimates between real-time and retrospective estimates at t=240.

case_study

parksw3 commented 1 year ago

So I took the case study and did my own fits (see ebola_test.R in scripts/tests/). At t=240, we have 7409 samples for real-time and 7475 samples for retro. They have empirical means of 6.2 and 6.22, respectively. So the actual fits should be essentially identical.

I tried fitting lognormal distributions using naive_delay


naive_trunc <- naive_delay(
  data = case_study_data_trunc, cores = 4, refresh = 0
)

naive_retro <- naive_delay(
  data = case_study_data_retro, cores = 4, refresh = 0
)

naive_trunc_draws <- extract_lognormal_draws(naive_trunc)
naive_retro_draws <- extract_lognormal_draws(naive_retro)

plot(density(naive_trunc_draws$mean))
lines(density(naive_retro_draws$mean), col=2)

Rplot

which gives nearly identical posteriors. So I'm going to guess there's nothing wrong with the models or data. So maybe something wrong with the targets pipeline, especially the sampling process?

parksw3 commented 1 year ago

So I took a look at sampled observations (also in the same file):


tar_load("sampled_ebola_observations")
sampled_ebola_observations_240 <- sampled_ebola_observations %>%
  filter(scenario=="240 days")

sampled_ebola_observations_240_rt <- sampled_ebola_observations_240 %>%
  filter(obs_type=="real-time")

sampled_ebola_observations_240_ret <- sampled_ebola_observations_240 %>%
  filter(obs_type=="retrospective")

mean(sampled_ebola_observations_240_rt$stime-sampled_ebola_observations_240_rt$ptime) # 5.91
mean(sampled_ebola_observations_240_ret$stime-sampled_ebola_observations_240_ret$ptime) # 6.1

sd(sampled_ebola_observations_240_rt$stime-sampled_ebola_observations_240_rt$ptime) # 2.9
sd(sampled_ebola_observations_240_ret$stime-sampled_ebola_observations_240_ret$ptime) # 4.6

which have similar means but very different standard deviations. FYI, the "true" standard deviation of the real-time data set is 4.17 at t=240. And so 2.9 is pretty far from the truth. So it looks like we might be dealing with a particularly bad sample for the retrospective case.

Though I'm not sure why the naive fits looks so terrible and far from the truth (i.e., the full sample) or even the empirical mean/sd. Just to make sure, I tried doing my own fits and was able to replicate the results (not showing the actual histogram because it's pretty redundant):

naive_retro_sub <- naive_delay(
  data = sampled_ebola_observations_240_ret, cores = 4, refresh = 0
)

naive_retro_sub_draws <- extract_lognormal_draws(naive_retro_sub)

hist(naive_retro_sub_draws$mean)
hist(naive_retro_sub_draws$sd)

So I'm guessing lognormal distribution is probably just a really bad description of the subdata. This is just showing the histogram vs the lognormal fit (based on posterior median):

Rplot02

parksw3 commented 1 year ago

Another important aspect we're ignoring is that there is a considerable change in the mean delay over time:

ggplot(case_study_data) +
  geom_smooth(aes(ptime, delay_daily))

Rplot03

I feel like we need to show how to deal with this properly to some approximation (because we know that censoring also affects the mean, which we don't really know how to deal with yet) rather than doing fits to subsamples (or do both).

seabbs commented 1 year ago

This is really great Daniel, thanks for taking such a detailed look.

I agree the variation over time is a real-world issue we need to consider/discuss etc but would ideally be something for another paper where we look at this in simulation as well?

We don't really need to subsample the data do we? Perhaps for the case study we should just fit to the full data set? I don't think this will help massively but would better reflect real-world practice? We were time constrained but I think we/you have resolved that so it might also be nice to show speed with real-world samples?

So I'm guessing lognormal distribution is probably just a really bad description of the subdata. This is just showing the histogram vs the lognormal fit (based on posterior median):

This is also something we have basically just ignored and we should definitely comment on (perhaps we should be showing the posterior predictive plots to indicate if our fit to the data is bad?). Again I sort of feel exploring other distributions (and showing how to do this) might be a bit much for this paper?

I'll take the sampling step out now for the case study and run everything just to see what happens. Will report back.

parksw3 commented 1 year ago

I agree the variation over time is a real-world issue we need to consider/discuss etc but would ideally be something for another paper where we look at this in simulation as well?

This seems fine. I think we can leave it for another paper.

We don't really need to subsample the data do we?

I don't think so. I don't know when subsampling was added but I'm guessing this was added because the latent model takes forever to run on the full data set. I tried to do it last night but gave up. It can be nice to show speed with real-world samples. Or we could do replicates of subsamples if we don't want individual subsample affecting our conclusions.

Again I sort of feel exploring other distributions (and showing how to do this) might be a bit much for this paper?

I agree that it's too much for this paper. I guess my point was that we're looking at a bad subsample, rather than we should be looking at other distributions.

I'll take the sampling step out now for the case study and run everything just to see what happens. Will report back.

Sounds good. Though I should've asked you to wait a bit more until we resolve other problems...

seabbs commented 1 year ago

I dropped sampling for the case study and increased the maximum sample size for the outbreak simulation to 400 (leaving 200 as an additional sensitivity) (see #35). There are still a few odd signals in the results but I think most of the issues due to unrepresentative subsampling have been sorted out.

I agree it is nice to show real-world performance with a real-world sample.

I guess my point was that we're looking at a bad subsample, rather than we should be looking at other distributions.

Yes agree and we want to avoid this.

Sounds good. Though I should've asked you to wait a bit more until we resolve other problems...

It isn't really very painful for me to run this so no problem. Its clocking in at 3 hours for the full analysis at the moment and just letting it tick overnight etc

seabbs commented 1 year ago

This is what the case study output is now looking like. Much closer to what I would expect I think.

case_study

parksw3 commented 1 year ago

Hm, this still looks off to me. For example, the real-time and retrospective estimates should be pretty different without truncation. But they look nearly identical at t=60. And the truncation method seems to be overshooting way too much. Let me dig in more...

seabbs commented 1 year ago

😓 oh dear you are not wrong. I think everything apart from naive, filtered, and truncated looks semi-sensible?

I looked at the diagnostics today and there don't seem to be any fitting problens for these models (but there are a few for the exponential simulation that I think explains the issues with the posteriors there)

parksw3 commented 1 year ago

Just getting to this now.. had to work on something else earlier...

First, looking at the data at t=60 (in ebola_test.R).

mean(case_study_data_trunc$stime-case_study_data_trunc$ptime) #5.3 days
mean(case_study_data_retro$stime-case_study_data_retro$ptime) # 5.5 days

So we don't see much epidemic growth before t<60 which explains why there isn't much truncation happening. So we're kind of good. We see slightly bigger differences in standard deviation (2.9 vs 3.6 days) which are probably driven by super long delays that we're not observing during real time outbreaks (we're only missing 21 cases).

And fitting the naive models actually give a mean of ~8 days as we see in your figure. So nothing wrong with the data or our models. It's just that the lognormal distribution is a really really bad description of the data??? This is the resulting fit:

Rplot05

which is kind of crazy because accounting for censoring somehow makes the fit so much better...

parksw3 commented 1 year ago

Ah ok. So it seems like padding zero is really messing up the inference here. If we get rid of zeroes and fit the naive model, we get good fits. If we look at the fits on a log scale (with padding), it looks like this:

Rplot06

Essentially, we are adding artificial variance to the data. As a sanity check, I also tried running the truncated model for padding vs dropping zero. When we drop zeroes completely, we get a more sensible results. Mean of 5.4 days (5.2--5.6) and SD of 2.4 days (2.2--2.7). We underestimate the standard deviation quite a bit because truncation model just isn't good at capturing super long delays.

parksw3 commented 1 year ago

I wonder if it's really a good practice to keep zero padding as the default option. Maybe keep zero padding but spit out a warning saying that zeroes are being padded, which can give biased estimates? We also want to do something about it in our simulation and case studies...

seabbs commented 1 year ago

Ah this is great detective work Daniel!

I think in general all of this is highlighting how important fitting a well-specified distribution is and we need to flag that (and that we are effectively ignoring it in this work).

It's interesting that early on in the case study there is basically no truncation due to the slow growth and the impact is just on the sd. That is definitely not a scenario we cover in the simulations but I can see it happening quite a lot (for example if you have some amount of case importation and then a seeding event for a local outbreak). Probably just something we need to discuss I guess.

Makes sense that the zero padding is causing problems and I agree it does seem like perhaps we can't keep it as the default. Perhaps we should be dropping by default and truncating at 1 to account for that (in models that have truncation)?

I guess its a shame we picked the lognormal as this shouldn't be an issue a lot of the time (again perhaps we can reflect on this?)

seabbs commented 1 year ago

I've got to drop of for a bit now but will circle back to this in a bit.

parksw3 commented 1 year ago

Ah this is great detective work Daniel!

Thanks!

I think in general all of this is highlighting how important fitting a well-specified distribution is and we need to flag that (and that we are effectively ignoring it in this work).

I agree

It's interesting that early on in the case study there is basically no truncation due to the slow growth and the impact is just on the sd.

This is certainly interesting. And I think it's worth discussing too. There was another plot that I thought might be useful... will write about this soon-ish (before we meet).

Perhaps we should be dropping by default

I think dropping seems fine. We can explain in the methods that we tried padding earlier but it gave really biased answers in some cases and we don't advice doing it in general.

truncating at 1 to account for that (in models that have truncation)?

This is interesting... I guess this works? We also don't have to worry about this in models with censoring right? So I guess we only have to worry about it in the pure truncation scenario?

I guess its a shame we picked the lognormal as this shouldn't be an issue a lot of the time (again perhaps we can reflect on this?)

Hm, it's not immediately obvious to me that how much it doesn't matter for other distributions. And we don't want to go down that rabbit hole. I guess the whole point is that there are many practical issues that people should be thinking about, and it's important that we point them out.

seabbs commented 1 year ago

After changing to droppinng zeros where we aren't censoring and having left truncation at 1 where this is happening and its possible things are looking much better. Note that the censored observation time (for truncation) has been changed here as well from the upper bound of the primary event to the lower bound.

case_study

seabbs commented 1 year ago

This all looks good I think so closing out.