[Question] Exposure variable

andrybicio commented 3 years ago

Can you please explain better (maybe with some refs), how does exposure variable work? exposure = pymc3.Deterministic( "exposure", pymc3.math.clip(tests, observed.daily_tests.max() * 0.1, 1e9), dims=["date_with_testcounts"] ) I see here that we are defining a variable "exposure", which is bounded between the maximum number of tests for that day times 0.1 and a huge number.

Later, this variable is used as it follows only for the dates where both case and test count are available (dates present in "mask_exposure"): exposure = idata.posterior.exposure[:, :, mask_exposure].rename({coord_exposure: "date_with_data"}) exposure_profile = exposure / idata.constant_data.exposure.max()

I do not understand what values can take the "exposure" variable, and moreover how this variable is computed every day. I understood it is similar to the "exposure" present in Poisson regression models, but in doing so aren't we fixing a priori some sort of "rate of positiveness"? I mean: if we notice that during weekends number of processed tests is way less, and therefore we predict a larger number of them and consequently a larger number of infected people according to this exposure, aren't we overestimating the total number of infected for the very next days? Indeed, the number of processed tests is less during weekends, but these same tests are being processed at the beginning of the next week and will appear in the official data. Aren't we introducing some sort of double counting in doing so?

michaelosthege commented 3 years ago

@andrybicio your concern is valid. The clipping in the model is just done as a numerical safeguard. However, the lower bound of 10 % of the maximum could be outdated, as many countries have significantly increased their testing volume since that line was written.

The weekend/weekday exposure effect is not so much in the model itself, but more in get_scaling_factor: https://github.com/rtcovidlive/rtlive-global/blob/1d9bfb5bdfae25be8e615194f0051b7912215135/rtlive/model.py#L253-L255

For most countries the overall fluctuation of testing volume has a larger amplitude than the within-week fluctuation, therefore I don't expect this overestimation to be very pronounced. Nevertheless I would be happy to discuss alternatives to exposure_profile = exposure / idata.constant_data.exposure.max(). Maybe using the max(weekly_averages) instead of the max(daily)?

andrybicio commented 3 years ago

From an epidemiological point of view, given the timescales we know about COVID, strong variations should not occur daily. Therefore I think averaging over a week could be a proper solution to compute a consistent and realistic rate of positiveness. Indeed it often changes during a wave (being max when it is exponentially spreading), not making too large jumps.

One more question is related to how it works: once exposure has been initialized, the rate of positiveness is computed when both data about tests and positive results were available. Then, essentially the maximum value in the "exposure_profile" (which if I understood correctly is an array containing the daily rates of positiveness?) is used as the "exposure factor" that multiplies total_positive, and in turn, returns total_inferred?

If so, I would consider the weekly averages as a solution (or a rolling average, 7 days window), since assuming that the minimum rate is at least 10% is a really strong assumption (it leads indeed to an overestimation), while not being constant throughout an epidemic.

michaelosthege commented 3 years ago

No, the code doesn't look at the fraction of positives (postive rate), but just at the number of total tests. exposure_profile is just a vector that is proportional to the total number of tests (per day).

Feel free to open a PR for the week-based exposure. But keep in mind that there are sometimes gaps in the data..

andrybicio commented 3 years ago

Does it mean that if the number of tests is below observed.daily_tests.max() * 0.1 due to weekends or any other cause, it is forced to be above it? Or does it just happen only at the starting since it is how prior acts?
If so (it only strongly affects at the starting), how does exposure act later on? Does it force the number to be above a certain threshold? What is its domain?
Moreover, I think then that it is not so clear to me how scale_factor is computed: p_observe is defined as the sum of p_delay, which sums up to 0.9999999 (i.e. all infected are tested). Then: scale_factor = total_observed / total_inferred / p_observe. But as much as I understand: total_observed < total_inferred. How can the scale factor become more than 1, given that normalizing wrt p_delay is like multiplying by 1?
Has exposure_profile domain [0,1], and it is 1 when daily tests made had the maximum value?

michaelosthege commented 3 years ago

It might be easier if you print/plot the variables yourself. This is the trace for "DE / all" a few days ago: 2021-02-12 DE_all.zip

In a Jupyter notebook just load and display it with ArviZ:

idata = arviz.from_netcdf("trace.nc")
idata

The trace object contains all the information that goes into calculation of the scaling factor. The helper function pymc3.gp.util.plot_gp_dist is ideal for plotting the densities:

# helper variable that has all chains stacked together
posterior = idata.posterior.stack(sample=("chain", "draw"))

fig, ax = pyplot.subplots()

pymc3.gp.util.plot_gp_dist(
    ax=ax,
    x=posterior.exposure.date_with_testcounts.values,
    samples=posterior.exposure.values.T
)

pyplot.show()

You can then copy the intermediate variables from the scaling factor calculation and plot them one after another.

rtcovidlive / rtlive-global

[Question] Exposure variable #15