seabbs commented 2 years ago

Applications

Serial interval and incubation period estimation for Monkeypox (also used as part of CDC reporting). Censoring adjustment but not right truncation adjusted. Based on EpiEstim which itself uses courseDataTools (https://github.com/nickreich/coarseDataTools). https://www.medrxiv.org/content/10.1101/2022.10.26.22281516v1.full.pdf
courseDataTools has a range of linked citations with I think the methods coming from here: https://doi.org/10.1002/sim.3659. From reading it makes use of two doubly censored approaches one of which is a reduction of the other. Its frequentist and I think corresponds to our simply censoring approach of the latent approach without truncation (assuming uniform priors). They simulate using a uniform prior for the day of the primary event, a log normal for the distribution and then censor the secondary event (so no phase bias issues in their simulation). They explored diurnal (waking day) biased priors and found minimal impact. They also investigated a spiked prior and found this had more impact.

In 2003, the World Health Organization recommended that 200 observations would be needed to estimate the incubation period distribution[23]; however, that recommendation was not based ona rigorous statistical analysis that accounted for coarse data arriving in a real-time epidemic. Our methods can be used to investigate the adequacy of the WHO sample size recommendations. A focused statistical investigation could provide evidence-based sample size guidelines for estimating both the center and the tails of the distribution under different levels of coarse data.

These results suggest that if prior knowledge indicates that the exposure may not be uniform,then that information should be incorporated into the analytic techniques

Something worth adding to our discussion is this can be done trivially for our approach either via brms or using stan directly which is nice.

This could be a useful way to frame our exploration of sample size.

For the single censored model courseDataTools uses survival but its own code for the doubly censored model (which assumes uniform censoring).

Screenshot 2022-11-01 at 12 34 05

courseDataTools received a lot of recent usage for reference.

Using statistics and mathematical modelling to understand infectious disease outbreaks: COVID-19 as an example: https://doi.org/10.1016/j.idm.2020.06.008

Theory

Left truncation + censorings vs naive methods: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7770078/
Understanding an evolving pandemic: An analysis of the clinical time delay distributions of COVID-19 in the United Kingdom: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0257978.
Estimating a time-to-event distribution from right-truncated data in an epidemic: A review of methods: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9465556/

seabbs commented 2 years ago

Estimating a time-to-event distribution from right-truncated data in an epidemic: A review of methods

This is a nice review in the context of Covid. Gives a pretty nice explanation of the problem for a lay audience which is worth a read for how we may want to communicate.
It focuses on methods' ability to estimate the asymptotic relative efficiency (ARE).
Focusses quite a bit on identifiability issues with non-parametric methods.
Methods with joint estimation of primary events and the delay distribution (i.e nowcasting like).

Another example of the use of 𝐿2 is given by Cox and Medley, 11 who estimate the distribution of the time 𝑇 taken for an AIDS diagnosis to be reported to the Communicable Disease Surveillance Centre. They allow the rate of AIDS diagnoses to be increasing sub-exponentially, by using ℎ(𝑡;𝜆)=𝜆0exp(𝜆1𝑡+𝜆2𝑡2) , and test the null hypothesis that 𝜆2=0 . They consider several parametric models for the distribution of the reporting delay 𝑇 .

This is a really nice early nowcasting paper that implements a model very similar to that in epinowcast

They describe what we have been looking at as conditional-on-initial (which obviously makes sense)

𝐿3 does not require a model 𝑓𝑋(𝑥;𝜆) to be specified for the distribution of the initial event times. This eliminates the risk that such a model may be misspecified. However, it has the disadvantage that some of the information in the data is being discarded, which makes 𝐿3 less efficient than 𝐿1 , especially when 𝜏 is small.

Something we could consider is bringing in the epinowcast model to this comparison. I would rather not but it should be the L2 model.
In general this is written for quite a high level audience.
Censoring is not really adjusted in detail/fully.
They explore a range of growth rate conditions and gamma distributions though with a fixed mean of 19 days (based on early pandemic estimates).
The evaluation component is really only focussed on efficiency.
They estimate everything using flexsurv. It isn't clear to me how this handles censoring when combined with truncation and I think perhaps it is being ignored.

In addition to being right-truncated, 𝑌 may be censored. This is easily handled in parametric models by replacing 𝑓∗𝑇(𝑡𝑖) in 𝐿1 and 𝐿3 by 𝐹∗𝑇(𝑡𝑈𝑖)−𝐹∗𝑇(𝑡𝐿𝑖) , where [𝑡𝐿𝑖,𝑡𝑈𝑖] is the interval within which individual 𝑖 ’s delay is known to lie.

Censoring comment in the discussion. Doesn't address what to do about truncation.

seabbs commented 2 years ago

Using statistics and mathematical modelling to understand infectious disease outbreaks: COVID-19 as an example

Has a nice discussion of some of the general challenges faced when estimating delays in real-time. Also uses early Wuhan data to indicate the impact of truncation (using the public linelist).
Explores the impact of censoring with windows wider than 1 day per event but doesn't make an adjustment.

sbfnk commented 2 years ago

As a specific example there is also Estimates of the severity of coronavirus disease 2019: a model-based analysis which uses growth rate adjustment of naive delays (similar to what is used in Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data).

Perhaps it's even worth adding the growth rate adjustment described in Section 2.1 of the supplement (with code) to the scenarios investigated?

seabbs commented 2 years ago

@parksw3 had originally had dynamic adjustment much more front and centre in this work. We have pushed it back a bit as the simulations have got more complex as its a bit hard to implement when growth rates are varying and 2 if they are varying how do you estimate them without a joint model (i.e meaning the post-processing approach is perhaps not ideal). The easiest thing to do is treat them as known but then does that really help people who want to implement these methods.

I think the plan is definitely to keep it in (at least in some form). I am currently investigating a simplified form of the forward correction (i.e having the growth rate jointly estimated in model) which should be a bit easier to compare to other approaches (maybe).

OvertonC commented 2 years ago

Using statistics and mathematical modelling to understand infectious disease outbreaks: COVID-19 as an example

Explores the impact of censoring with windows wider than 1 day per event but doesn't make an adjustment. If I remember correctly I think I assumed exponential growth across the interval censored period and then added the exponential to the likelihood and integrated over this. Similar to other approaches, but using "L3" rather than "L2" using Shaun's definitions, which I think is the one I've seen more commonly applied. However, I would now prefer a latent variable approach, as this is not vectorisable and had terrible computational scaling (which is also why I prefer L3 as L2 isn't vectorisable when accounting for interval censoring and right truncation even when using the latest variable approach). But this approach still requires knowing the growth rate in advance, and for long intervals the results are very sensitive to the growth rates, for the Wuhan data it could push the modelled mean from 4-7 days by moving the growth rate over a small range.

seabbs commented 2 years ago

Screenshot 2022-11-03 at 10 50 41

Ah you are right. Isn't this L1 according to Shaun's framework where the growth rate is known? L3 is explicitly not joint modelling. Do you have the likelihood for the complete model you used here written down somewhere? Or for that matter the code? Just had a brief browse and can only see code for the other bits of the paper.

I totally agree that needing to know the growth rate/assume it is fixed is a limitation I am not really willing to accept.

OvertonC commented 2 years ago

I believe this is L3, which I think of as "forward looking", but the essential meaning of "conditional on initial" is the same. This involves conditioning on time of the first event and looking at the distribution of secondary event times. If the first event time is known exactly, all the joint modelling parts cancel out, which is why you don't have any joint modelling. However, if you have interval censoring on the first event time, they no longer cancel out since the g(i) terms fall inside the integrals. Which is why we use the latent variable approach in the Bayesian model, so that the event times are sampled, and then the integrals over "i" disappear and the g(i) terms cancel out, so no joint modelling is required.

seabbs commented 2 years ago

I think as written this screenshot is a touch unclear (but this really doesn't matter) and more in line with the joint approach (i.e L1). I agree if you condition on primary events (and therefore don't model their uncertainty etc) you can call g terms and rewrite the likelihood as done in L3.

I agree it can be dropped without censoring or when censoring is otherwise handled. Though as we discussed for longer censorings windows that is no longer trivial.

seabbs commented 2 years ago

Suggestion from @sbfnk to look at Ebola Virus Disease in West Africa — The First 9 Months of the Epidemic and Forward Projections (supplement).

They do a lot of distribution estimation aiming to correct for left truncation (they call this censoring but I think it sounds like it isn't (as they apply the correction to all data and not just the censored observations) and daily censoring. They do a daily censoring adjustment by just shifting all the data by half a day. This seems like it should add some bias but be better than doing nothing. I don't want to add more work but perhaps we do need to investigate this as commonly used? I am not totally clear why they have left truncation and it seems like right truncation would be a much much bigger deal in their data given the state of the outbreak when this was published. Perhaps this is a mistake in the equations?

I guess this approach makes sense if filtering out recent observations based on delay length but as written this would apply to all short delays (including those far in the past) which seems incorrect.

You may want to take a look at their section discussing generation time estimation @parksw3 for other work if you haven't already...

I see nothing in the papers citing this that indicates any mistakes have been flagged but lots and lots of reuse of these distribution estimates for quite "high impact" work so if we do agree there are issues it's a good thing to discuss heavily.

Screenshot 2022-11-14 at 10 49 05

Screenshot 2022-11-14 at 10 35 15

seabbs commented 1 year ago

Example estimating the incubation period of Monkeypox with some mention of censoring but none of truncation: https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2022.27.24.2200448

Cites this COVID paper for its method details: https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2020.25.5.2000062#html_fulltext

Method details are not in the supplement (its just the main text so very sparse). They do a censoring correction for unknown exposure times but no daily censoring adjustment and no truncation adjustment (see stan code below). They published data and so in theory this is something we could look at as a real-world case study if we so wished (not sure we need to or should)

data{
  int <lower = 1> N;
  vector[N] tStartExposure;
  vector[N] tEndExposure;
  vector[N] tSymptomOnset;
}

parameters{
  real<lower = 0> alphaInc;     // Shape parameter of weibull distributed incubation period
  real<lower = 0> sigmaInc;     // Scale parameter of weibull distributed incubation period
  vector<lower = 0, upper = 1>[N] uE;   // Uniform value for sampling between start and end exposure
}

transformed parameters{
  vector[N] tE;     // infection moment
  tE = tStartExposure + uE .* (tEndExposure - tStartExposure);
}

model{
  // Contribution to likelihood of incubation period
  target += weibull_lpdf(tSymptomOnset -  tE  | alphaInc, sigmaInc);
}

generated quantities {
  // likelihood for calculation of looIC
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = weibull_lpdf(tSymptomOnset[i] -  tE[i]  | alphaInc, sigmaInc);
  }
}

seabbs commented 1 year ago

Lauer paper which we made a lot of use of early on (and late on for that matter) as the principle incubation period estimate: https://www.acpjournals.org/doi/10.7326/M20-0504

They used: "using a previously described parametric accelerated failure time model (13)" which reminds me we do need to make the point clearly that this estimation task is best thought about as a time to event (i.e survival problem) and therefore use methods (like we do) from that silo.

The actual implementation they used was: coarseDataTools and activemonitr - first one is some kind of fairly reasonable censoring (but not truncation) adjusted method and no idea about the second one. I wouldn't have described that method as a parametric accelerated failure time model but perhaps it is or perhaps they used something else for the actual estimation?

Code: https://github.com/HopkinsIDD/ncov_incubation

Yup they just use courseDataTools so no truncation adjustment but they are accounting for double censoring in a way that I think is sensible (at least will need to dig more into Reich et al. to work out if it isn't)

seabbs commented 1 year ago

In this (https://onlinelibrary.wiley.com/doi/full/10.1111/j.1541-0420.2011.01709.x?saml_referrer) work by Reich et al. they deal with the truncation issue using an EM maximisation approach (that seems fine) for CFR estimation. They don't do anything about the delay they are actually using being truncated and it appears in general there is no functionality in coarseDataTools to do this.

seabbs commented 1 year ago

We haven't really discused where this paper fits which we maybe should: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0257978.

seabbs commented 1 year ago

Growth rate correction being used in the wild: https://www.mdpi.com/2077-0383/9/2/538

OvertonC commented 1 year ago

We haven't really discused where this paper fits which we maybe should: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0257978.

The addition on the incubation period is nice. We've currently added this to our forward looking (L3) approach instead of the joint approach (L2 I think?). The joint approach was horribly slow since evaluating the integral on the denominator is a pain in stan. An issue with the incubation period approach is that is isn't quite correct, as there is an uncorrected epidemic phase bias in there. I think it is possible to correct this, but you need to add a backcalculation of infection incidence, which we've not tried to implement. If the delay of interest is much longer than the incubation period (e.g. if looking at time to death) then the missed epidemic phase bias is hopefully negligible. But for e.g. time from onset-to-testing, the incubation period is likely to be longer, so the magnitude of the missed epidemic phase bias is probably larger than the epidemic phase bias we're putting lots of effort in to correct for the onset-to-testing delay.

seabbs commented 1 year ago

I looked up fistdistr and that supports censored fitting but nothing else. It also provides no guard rails so literally you just specify left and right censoring per data point. I've seen a lot of mistakes being made with this for daily data in the wild.

seabbs commented 1 year ago

n issue with the incubation period approach is that is isn't quite correct, as there is an uncorrected epidemic phase bias in there. I think it is possible to correct this, but you need to add a backcalculation of infection incidence, which we've not tried to implement. If the delay of interest is much longer than the incubation period (e.g. if looking at time to death) then the missed epidemic phase bias is hopefully negligible.

Sounds like you were thinking along the same lines as @parksw3 and I! Looking forward to seeing your work on this.

Also I guess this ends up being similar to using a latent delay in an epinowcast style approach (Poisson version of L1?)? That would be interesting to explore. I suppose the big advantage is much easier support for non daily censoring windows? Edit: Is this true or am I dreaming?

seabbs commented 1 year ago

via @sbfnk: "Estimating the serial intervals of SARS-CoV-2 Omicron BA.4, BA.5, and BA.2.12.1 variants in Hong Kong" (https://onlinelibrary.wiley.com/doi/pdf/10.1111/irv.13105)

It uses the fixed growth rate truncation adjustment approach but with sensitivity analysis on the growth rate (I think a method that uses a prior here would help people if we feel like supplying it). It also appears to additionally do right truncation adjustment on top of this so is a nice example of this issue for the introduction

parksw3 commented 1 year ago

"Estimating the serial intervals of SARS-CoV-2 Omicron BA.4, BA.5, and BA.2.12.1 variants in Hong Kong"

I saw this paper too earlier and thought I already added the paper, but turns out I didn't... oops... this paper also made me wonder whether we need to show somewhere in the SI that accounting for both truncation and growth rate approach is bad.

seabbs commented 1 year ago

yeah I agree but perhaps we can hold off on that whilst we knock everything else into shape.

parksw3 commented 1 year ago

good point. Also agree with that.

seabbs commented 1 year ago

Suggested by Shaun Seamen this paper may be useful for discussing approaches for different censoring assumptions: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2697

seabbs commented 1 year ago

A very new study to discuss: https://www.thelancet.com/journals/lanmic/article/PIIS2666-5247(23)00005-8/fulltext

Seems to be approaching things from a fairly odd angle but has all the same issues from what I can see

parksw3 / epidist-paper

Examples of delay estimation from the literature. #19

Applications

Theory