idem-lab / epiwave.pipelines

0 stars 3 forks source link

flexibility in estimate_delays #4

Open smwindecker opened 11 months ago

smwindecker commented 11 months ago

Currently function uses the data to estimate delays. We should have the flexibility to use the linelist for certain dates/states, but specify other dates/states for which we should use either a national average, a disease literature average, or other.

Should not make it too easy to default to using bad data.

smwindecker commented 11 months ago

Further discussion == implementing multiple imputation for this task instead

AugustHao commented 8 months ago

need to estimate time-varying delays from paired dates data, current approach is to construct a rolling window for paired date delay data, and then getting cdf over those rolling windows. This is computationally expensive, so a long term goal is to find a better way to implement this, but noting that we have something that works in the meantime.

Key points to consider:

the goal is to estimate delay over a continuous time period, but paired date data does not necessarily cover all of the dates in this period, ie there are gaps in the timeseries where we do not observe paired delays due to missing observation of one of the dates. This means that we necessarily have to interpolate delay distribution between some date ranges. if we can define a parametric form of the delay distribution, with the distribution parameters as time varying variables, we can learn them from data using a modelling approach. But this relies on very strong assumptions about the shape of delay distributions, which is undesirable. there may be a way to mix parametric and non parametric densities in an informative way?

AugustHao commented 8 months ago

have a way to filter out recent days from calculation of delays

see this paper appendix A for a similar approach/justification: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05428-4#appendices

in summary, because not all recent infections had been observed yet in the latest reported cases, those that would have been observed would have shorter delays than average. So if we had observed these shorter delays, and computed time varying delays following these observations, then we would erroneously underestimate delay for the most recent time period. Thus we should ignore information about delay in the most recent days and clamp delay distribution as constant at about 1 max delay range from the present, as they have done in the paper