Ebola 2014 Discussion - Githubissues

grantbrown / libspatialSEIR

A C++ and OpenCL framework for fast Bayesian spatio-temporal compartmental epidemic modeling.

Other

15 stars 2 forks source link

Ebola 2014 Discussion #1

Open grantbrown opened 9 years ago

Alfremath commented 9 years ago

Hi, Comment: Could you please add a change log? Question: On the conclusions do you mean June 27th or actually July 27th?

grantbrown commented 9 years ago

Hi, thanks for the feedback.

As for a change log, everything that was changed between versions is logged in source control. Git might not be familiar to everyone, so perhaps I should include a link at the top.

The most recent diff record is here:

https://github.com/grantbrown/libspatialSEIR/commit/e84d8c12a770ef20506e108297e270a2fa368628

There's a lot of info there, but if you scroll past all the uninteresting bits (javascript/JSON etc) it catalogs the changes in text and graphs. Perhaps the best way to see the changes over time is to clone the gh-pages branch of the repository, then you can actually walk through all the past versions in a browser via "git checkout ____" for the appropriate revision number.

You're correct that it should say "July 27", I'll get that fixed.

Edit: I've also added a section at the top linking to an archive of previous models.

devosr commented 9 years ago

Hi, First, thanks for your contribution to the science in this event.

Now that news is out about case zero at http://www.nytimes.com/2014/08/10/world/africa/tracing-ebolas-breakout-to-an-african-2-year-old.html?_r=0 could you include this data point in the analysis.

Observation1 : it is clear that the maximum potential R0 is much larger that the ending (current) R0. Unchecked, the maximum simple reproduction rate can be estimated by this point and the initial slope of the curve. As with any trend of this type, it is instructive to plot the data on a log scale to understand the impact of events on R0.

Observation 2: The containment effort at the initial stages of the event are clear in the data, impacting the R0 and bringing it to a controlled outbreak as your plot shows.

Observation 3: the recent loss of control and it's impact on R0 may be influenced by the smoothing techniques. In my simplistic analysis, the instantaneous rate over the last 2 months show a growth rate approximately one half of the initial outbreak in my analysis possibly due to the increased awareness of the public and rudimentary containment efforts.

Again, thank you for applying your time and effort to this serious event.

If you provide a personal email, I will share my analysis. Rick

grantbrown commented 9 years ago

Hi Rick,

There are a number of data improvements I'd like to make, given time. In addition to the new work on patient zero, Blaise et al in NEJM describes some of the early cases. Also, I'd like to work through the actual WHO case reports to build a record of their estimates of new counts rather than relying on the "uncumulate" function given in the code. I may not have time to get to these for a while, as there's a lot of work to be done on the core library (but I will try).

As for R0, I will take a look at adding log scale plot. I'd be careful reading too much into the maximal value, especially before more work has been done to compare independent estimates of the infectious population size over time to the estimates in the model. It certainly is an interesting trend, however.

I definitely agree that the smoothing techniques used to drive the intensity process may be impacting the R0 estimates. I haven't had a chance to do much model selection with respect to basis complexity, or yet tried to incorporate any external information which could impact the intensity process (weather etc). This is one of my biggest priorities for future analysis.

I'd be very interested to take a look at your analysis - my email is grant.brown73@gmail.com -Grant

devosr commented 9 years ago

Hello Grant, Thanks for the quick response As you noted on the website the use of spline or polynomial curve fit is limited in its ability to extrapolate beyond the data whenever you model a physical phenomena whose rate of growth is dependence on its current size. Modeling the data in logarithmic space provides much improved accuracy for extrapolation. I am running a google docs spreadsheet with simple log growth analysis that I have shared on a separate link. It is a work in progress (2 days old) so please excuse the formatting. To improve the accuracy, I will need to model the clusters separately since each have different levels of control, resulting in different reproductive rates. The clusters can then be added in linear space.

also, it would be helpful for me to segment the populations by period. Early growth was unconstrained and deserves a separate analysis, as is the growth in time segments of each cluster.

R is a traditional language that results in inflexibility when doing analysis of trends. It's important to develop the equation set in another tool and then port the methodology to R.

Another limitation of the analysis is the inherent assumptions behind least squares. Maximum likelihood method will be important since it provides a naive assumption to the distribution of the cluster and provides a more flexible fit as a result. I may try the use of weibull MLE tools to better understand the rate equation.

The current expansion of this disease is out of control based on the data and is extremely dangerous. It is critical that we overwhelm the current cases with containment in order to prevent a pandemic. I'm afraid that Nigeria will be the next to take casualties.

Early accuracy in the modeling must provide the decision makers with an accurate sense of urgency.

Regards Rick

On Sun, Aug 10, 2014 at 2:56 PM, Grant Brown notifications@github.com wrote:

Hi Rick,

There are a number of data improvements I'd like to make, given time. In addition to the new work on patient zero, Blaise et al in NEJM describes some of the early cases. Also, I'd like to work through the actual WHO case reports to build a record of their estimates of new counts rather than relying on the "uncumulate" function given in the code. I may not have time to get to these for a while, as there's a lot of work to be done on the core library (but I will try).

As for R0, I will take a look at adding log scale plot. I'd be careful reading too much into the maximal value, especially before more work has been done to compare independent estimates of the infectious population size over time to the estimates in the model. It certainly is an interesting trend, however.

I definitely agree that the smoothing techniques used to drive the intensity process may be impacting the R0 estimates. I haven't had a chance to do much model selection with respect to basis complexity, or yet tried to incorporate any external information which could impact the intensity process (weather etc). This is one of my biggest priorities for future analysis.

I'd be very interested to take a look at your analysis - my email is grant.brown73@gmail.com -Grant

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-51723708 .

grantbrown commented 9 years ago

Hi Rick, thanks for the feedback. I'll take a closer look at the analysis you sent tomorrow afternoon. I just want to make a quick point of clarification: I'm not actually fitting polynomial or spline based models directly to the observed data/trend, but rather as the basis for the "intensity process". The epidemic intensity parameters are only part of the picture, while much of the estimated disease dynamics arise from the stochastic SEIR model. One of the benefits of this separation is the ability to allow different intensities in each spatial location overall and over time. An interesting option for future analyses will be to quantify the "size" of the various interventions (dollars? personnel per capita?) and use these co-variates to improve the intensity basis.

devosr commented 9 years ago

ahh .. this makes more sense, thanks for the clarification. i'll look more closely a the code for the "intensity process"

Regards Rick

On Sun, Aug 10, 2014 at 4:41 PM, Grant Brown notifications@github.com wrote:

Hi Rick, thanks for the feedback. I'll take a closer look at the analysis you sent tomorrow afternoon. I just want to make a quick point of clarification: I'm not actually fitting polynomial or spline based models directly to the observed data/trend, but rather as the basis for the "intensity process". The epidemic intensity parameters are only part of the picture, while much of the estimated disease dynamics arise from the stochastic SEIR model. One of the benefits of this separation is the ability to allow different intensities in each spatial location overall and over time. An interesting option for future analyses will be to quantify the "size" of the various interventions (dollars? personnel per capita?) and use these co-variates to improve the intensity basis.

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-51726598 .

grantbrown commented 9 years ago

Code wise, most of the model details are located in the underlying c++ library. This document is a bit more explicit than the main write up. I'm happy to answer any additional questions you have about it.

devosr commented 9 years ago

ok.. many thoughts come to mind. the slope of the curve on a log scale is greatly affected by the intensity factor ==> intensity factor should be estimated for each cluster and time.

The dataset should have the following columns to start the correlation a date b number of days in this data set c cluster number d intensity factor e log number of infected f first derivative of the infected (number of new cases / number of days ) g log number of deaths first derivative of the deaths (number of new cases / number of days ) a principal value decomposition can provide an indication of the relationship weight between the factors and the values optionally, you can calculate confidence bands

it should all tie together in log space with projected deaths in the following equation form

deaths = 10^(c0+c1_a+c2_b+....c8*h)

using least squares for the fit

you would need to estimate the intensity factors based on prior knowledge of the environment

I tried to put together a cluster based analysis but the rapid rate on inception in new environments lead to poor projections when the intensity factors were mixed and the number of infections were low (see the graph to the right). extrapolation of datasets with less than 100 deaths cannot be projected more than 1 month due to the rate of change of the intensity factor

Regards Rick

On Sun, Aug 10, 2014 at 6:43 PM, Grant Brown notifications@github.com wrote:

Code wise, most of the model details are located in the underlying c++ library. This document https://github.com/grantbrown/libspatialSEIR/blob/master/doc/models/Ebola2014Analysis.pdf is a bit more we explicit than the main write up. I'm happy to answer any additional questions you have about it.

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-51729855 .

grantbrown commented 9 years ago

Hi Rick, I've just gotten a chance to take a look at your analysis. I need to preface this by mentioning that I'm not particularly familiar with deterministic modeling approaches, as I come from the statistical side of things. It also takes me a while to parse spreadsheet logic, as it's hidden in the cells - I prefer to work in something like R/Python for quick analysis sketches because the logical bits are explicitly recorded separately from the data.

With that being said, I have a couple of quick comments:

You mention that you're interested in using maximum likelihood. As a statistician, specifying a likelihood (what I consider a model) is where I usually like to start. On the other hand, when you try to introduce a likelihood you'll have to explicitly confront the auto-correlation inherent in temporal data. What you're doing right now looks a lot to me like fitting a traditional regression model to log transformed data, minus the implied error term distribution. (To be fair, I'm not technically doing maximum likelihood estimation either, but for a Bayesian analysis the likelihood plays a critical part in the posterior distribution of the unknown parameters). On the other hand, there is a rich literature of deterministic SEIR and SIR models which avoid the likelihood issue entirely by fitting systems of differential equations, and trying to capture variability through simulation. I prefer the stochastic approach, as it allows me to (given a model) quantify uncertainty about all of the parameters in a formal way, as well as quantify the degree of prior information I'm including (as opposed to fixing parameters). A good example of this is the E to I and I to R transition probabilities I included in this analysis - there is pretty strong prior information included here, and if you plot the distributions you can get an idea of what the "plausible" transition probabilities are, according to the model.
It's worth thinking about the implications of your underlying model. You don't appear to be constraining your predictions based on a conjecture of what the disease dynamics might be, so the model you've chosen will continue to increase to infinity. That's one of the nice things about the SEIR framework, the predictions are constrained by the population sizes.
I'm not sure how I'd incorporate spatial heterogeneity into your approach. I believe there are quite a few multi-location deterministic SEIR models out there, but I'm used to defining the relationship between spatial locations as a component of the exposure probability calculation. For a full description of the assumptions that lead me to choose the hierarchical Bayesian approach that I have, this document has a bit more detail (including a derivation of a more general version of the spatial structure employed). It has the same format as the previous model pdf, but includes more of the underlying math, and is less specifically about a disease like Ebola.

I like what you've done, its a neat way to use simple tools to do complex things. On the other hand, the Ebola analysis I posted is really just an illustrative example of the general modeling approach I'm working on as part of my dissertation - I'm wedded to the stochastic spatial SEIR approach for now :)

devosr commented 9 years ago

You are right. The current analysis is a least squares fit of the log of the data. I haven't gotten a chance to kick off the weibuil model with an MLE approach. I may have an hour to try this tomorrow afternoon. I prefer MLE over least squares for a couple reasons. Least squares assumes that the distribution is normal (normal distn) around and normal (right angle) to the line that is being fit. It is distorted to weigh larger values with greater importance. The MLE iterates on the disn (cloud) that best fits the variation in the data. It saves alot of time in trying to model the variation around the mean.

Also, I'm a bit nervous about allowing the model to have too many degrees of freedom this early as it reduces the ability to extrapolate / predict the aggregate future with much accuracy. Maybe if there was a way to govern the detailed model by forcing it to add up to the more stable model.

The SEIR approach makes sense. I would focus on the application of the model to maximize the stability of future predictions. The I to R transition is property of the underlying virus combined with the property of the population. It should not change much once quantified.

However, the E to I transition will be a function of the intensity and the medical staff training and will vary cluster to cluster. If you constrain the model to reasonable values, this can help explain the perturbations such as the reduction in slope from april 15 to may 15. You could calibrate this transition by using the unconstrained growth at the start of the outbreak (from Dec to April.

The classic SEIR model should be enhanced. Sigma is a function of intensity and beta. you may be able to back out a sigma as a function of time and then explore other factors that might be influencing it to explain the values.

Lastly, my simple calculation for R0 puts it around 2. I'll dig into it a bit more tomorrow. probably a mistake on my part.

Your model should be able to recognize new outbreaks earlier by scanning the data for large cluster growth rates. Resources theoretically could be directed quickly to these locations to bring the simple reproduction rate to 1 or less. A much more efficient use of resources. Like putting out a spreading fire by catching the flareups.

unfortunately, the latest data is consistent with the simple log fit of the prior data. The WHO has yet to have an impact and the growth goes moderately unchecked. 1 month should provide better insight.

On Mon, Aug 11, 2014 at 8:33 PM, Grant Brown notifications@github.com wrote:

Hi Rick, I've just gotten a chance to take a look at your analysis. I need to preface this by mentioning that I'm not particularly familiar with deterministic modeling approaches, as I come from the statistical side of things. It also takes me a while to parse spreadsheet logic, as it's hidden in the cells - I prefer to work in something like R/Python for quick analysis sketches because the logical bits are explicitly recorded separately from the data.

With that being said, I have a couple of quick comments:

You mention that you're interested in using maximum likelihood. As a statistician, specifying a likelihood (what I consider a model) is where I usually like to start. On the other hand, when you try to introduce a likelihood you'll have to explicitly confront the auto-correlation inherent in temporal data. What you're doing right now looks a lot to me like fitting a traditional regression model to log transformed data, minus the implied error term distribution. (To be fair, I'm not technically doing maximum likelihood estimation either, but for a Bayesian analysis the likelihood plays a critical part in the posterior distribution of the unknown parameters).

On the other hand, there is a rich literature of deterministic SEIR and SIR models which avoid the likelihood issue entirely by fitting systems of differential equations, and trying to capture variability through simulation. I prefer the stochastic approach, as it allows me to (given a model) quantify uncertainty about all of the parameters in a formal way, as well as quantify the degree of prior information I'm including (as opposed to fixing parameters). A good example of this is the E to I and I to R transition probabilities I included in this analysis - there is pretty strong prior information included here, and if you plot the distributions you can get an idea of what the "plausible" transition probabilities are, according to the model.

1.

It's worth thinking about the implications of your underlying model. You don't appear to be constraining your predictions based on a conjecture of what the disease dynamics might be, so the model you've chosen will continue to increase to infinity. That's one of the nice things about the SEIR framework, the predictions are constrained by the population sizes. 2.

I'm not sure how I'd incorporate spatial heterogeneity into your approach. I believe there are quite a few multi-location deterministic SEIR models out there, but I'm used to defining the relationship between spatial locations as a component of the exposure probability calculation. For a full description of the assumptions that lead me to choose the hierarchical Bayesian approach that I have, this document https://github.com/grantbrown/libspatialSEIR/blob/master/doc/models/SEIR_Algo.pdf has a bit more detail (including a derivation of a more general version of the spatial structure employed). It has the same format as the previous model pdf, but includes more of the underlying math.

I like what you've done, its a neat way to use simple tools to do complex things. On the other hand, the Ebola analysis I posted is really just an illustrative example of the general modeling approach I'm working on as part of my dissertation - I'm wedded to the stochastic spatial SEIR approach for now :)

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-51859581 .

grantbrown commented 9 years ago

Hmm, I wouldn't say that least squares assumes normality (it's just a fitting method - I like to think of it in the geometric sense as the projection of a data vector onto the column space of a matrix of explanatory variables). If you're using traditional linear regression along with the associated distributional assumptions, then you'd be correct.

As for maximum likelihood, ML methods are entirely dependent on the likelihood you choose. For me, that starts with a disease process model. If you're fitting curves without regard to a specific disease process, GLMM's come to mind. These would allow you to tie a linear predictor (some function of time) to the outcome via a link function (it sounds like you want to use log-scale data), while incorporating a correlation structure in time. I'm not sure what likelihood I'd choose - Weibull doesn't immediately come to mind, but it could work.

When it comes to flexibility of fit (df) vs. predictive capability, my immediate thought is to use cross validation methods. This is actually my plan moving forward - I chose degrees of freedom for the intensity process bases pretty arbitrarily, but CV could really help pin down that trade-off for this problem.

I do want to clarify the E to I transition - exposed, in these models, refers to individuals who are infected but not yet infectious. That transition, as with I to R, is a property of the pathogen itself. Any limitations on the spread of the disease, such as interventions, affect the exposure process, the S to E transition. The language can be a bit confusing, but it's important to note that the population mixing process (ie, exactly how people contact each other) and the infectiousness of the pathogen are actually confounded in this model - that's why the combination gets the general label "intensity" process.

I'm not sure what you refer to with Sigma and beta - there are a lot of models and even more parameterizations.

As for predicitons, we can hope that the interventions will have an effect. My latest two week predictions (not yet posted) show the epidemic worsening severely in Liberia, leveling off (though not decreasing) in Sierra Leone, and worsening slightly in Guinea. Things are so variable right now that I'm no longer even trying to make predictions further than that.

devosr commented 9 years ago

I agree, you are on the right track.

Some clarifications:

Regarding the normal distn comment, it is more accurately stated... "Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator". more information can be found here https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_squares#Consistency_and_asymptotic_normality_of_.CE.B2.CC.82

Regarding the coefficient terms sigma and beta: they come from the differential equations set out here http://www.public.asu.edu/~hnesse/classes/seir.html

My overall approach to fitting data for maximum predictability is as follows: 1) model the physics as fundamentally as possible 2) limit the degrees of freedom in the model to those parameters that have been exercised well in the data set 3) assume that the data is noisy and the distn is unknown, then solve iteratively using MLE to characterize the cloud that represents the variation around the line 4) you will get the best predictability when your transformed line is straight (not very scientific, just experience)

unfortunately, I have not yet correlated the data with Weibull++ from reliasoft. I'll let you know how it turns out.

Best regards and hoping for lower R0 Rick

On Tue, Aug 12, 2014 at 9:48 PM, Grant Brown notifications@github.com wrote:

Hmm, I wouldn't say that least squares assumes normality (it's just a fitting method - I like to think of it in the geometric sense as the projection of a data vector onto the column space of a matrix of explanatory variables). If you're using traditional linear regression along with the associated distributional assumptions, then you'd be correct.

As for maximum likelihood, ML methods are entirely dependent on the likelihood you choose. For me, that starts with a disease process model. If you're fitting curves without regard to a specific disease process, GLMM's come to mind. These would allow you to tie a linear predictor (some function of time) to the outcome via a link function (it sounds like you want to use log-scale data), while incorporating a correlation structure in time. I'm not sure what likelihood I'd choose - Weibull doesn't immediately come to mind, but it could work.

When it comes to flexibility of fit (df) vs. predictive capability, my immediate thought is to use cross validation methods. This is actually my plan moving forward - I chose degrees of freedom for the intensity process bases pretty arbitrarily, but CV could really help pin down that trade-off for this problem.

I do want to clarify the E to I transition - exposed, in these models, refers to individuals who are infected but not yet infectious. That transition, as with I to R, is a property of the pathogen itself. Any limitations on the spread of the disease, such as interventions, affect the exposure process, the S to E transition. The language can be a bit confusing, but it's important to note that the population mixing process (ie, exactly how people contact each other) and the infectiousness of the pathogen are actually confounded in this model - that's why the combination gets the general label "intensity" process.

I'm not sure what you refer to with Sigma and beta - there are a lot of models and even more parameterizations.

As for predicitons, we can hope that the interventions will have an effect. My latest two week predictions (not yet posted) show the epidemic worsening severely in Liberia, leveling off (though not decreasing) in Sierra Leone, and worsening slightly in Guinea. Things are so variable right now that I'm no longer even trying to make predictions further than that.

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-52000798 .

developmentfiend commented 9 years ago

Hello,

Love the site and the updates.

With WHO info becoming increasingly sparse, and apparently undercounting the outbreak, is there anyway to backtrace based on what they said has been 'missing' from the numbers, and count from there?

Would also love to see a more extended projection of your model, as there is a dearth of prediction data for what's going to happen. While output is obviously prone to wild swings, some informed idea is better than none.

If you are unable to write a post, could you answer in a reply with the estimated #s for affected countries come 10/1, 11/1, and 12/1?

Any insight on spread would be appreciated too... other modelers have only pushed out two weeks, it would be nice to have a better grasp on potential. It unfortunately seems like Nigeria is now out of control... and the patient in Senegal also had extended and protracted contact on his journey over, from Guinea.

Things are not looking bright...

grantbrown commented 9 years ago

Hi developmentfiend,

Things are not looking bright indeed.

As for the under-counts, I am working on data models which will hopefully be able to do a better job of estimating this, but there is a lot of work left to do (statistically and from a programming perspective) before that's ready to go.

I'll see what I can do about longer term predictions, though first I need to finish sorting out some sampler efficiency changes that I've been working on. I think that the usefulness of the epidemic graphs is diminishing as the early days of the outbreak become swamped by the recent explosion of cases - I'm going to look into generating some tables. Although long term predictions haven't fared well so far, I fear we may be entering a more predictable state of exponential growth.

Keep an eye on the repository linked below for the latest (unfinished) model changes. I've split it off from this repo to keep the size down. When each iteration is ready, it will be posted to the usual place.

-outdated link removed-

devosr commented 9 years ago

Grant, a couple developments in my analysis that should affect the way that we model 1) by January, there is a predicted reduction in the Liberian population of 10%. It will be increasingly likely that an infected person will run into another infected person rather than an uninfected person, thus reducing the rate of transmission. Unfortunately this is a tragic situation but should be accounted for in the model.

2) The current rate of growth in Liberia is very close to the unconstrained rate of growth in the first 3 months of the outbreak, providing a possible indication of an upper limit to the maximum growth rate.

hope you are making progress on your model

On Sun, Aug 10, 2014 at 2:56 PM, Grant Brown notifications@github.com wrote:

Hi Rick,

There are a number of data improvements I'd like to make, given time. In addition to the new work on patient zero, Blaise et al in NEJM describes some of the early cases. Also, I'd like to work through the actual WHO case reports to build a record of their estimates of new counts rather than relying on the "uncumulate" function given in the code. I may not have time to get to these for a while, as there's a lot of work to be done on the core library (but I will try).

As for R0, I will take a look at adding log scale plot. I'd be careful reading too much into the maximal value, especially before more work has been done to compare independent estimates of the infectious population size over time to the estimates in the model. It certainly is an interesting trend, however.

I definitely agree that the smoothing techniques used to drive the intensity process may be impacting the R0 estimates. I haven't had a chance to do much model selection with respect to basis complexity, or yet tried to incorporate any external information which could impact the intensity process (weather etc). This is one of my biggest priorities for future analysis.

I'd be very interested to take a look at your analysis - my email is grant.brown73@gmail.com -Grant

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-51723708 .

grantbrown commented 9 years ago

Hi Rick,

Your first point highlights one of the benefits of the SEIR framework - these sort of dynamics are "baked in", so to speak.

As for the second, point, I'm not quite ready to comment. I've been thinking this morning about the estimation method I'm using for the basic reproductive number, and am considering some changes. First and most importantly, I think the current formulation may make unreasonable assumptions about the recovery period; I developed the methods while working with flu data on a weekly scale, and hadn't revisited them. Second, it seems to me that there are really two separate quantities of interest: the usually defined "basic reproductive rate" which captures the number of secondary infections caused by a single infectious individual in a fully susceptible population, and what we might call the "effective reproductive rate", which should capture the estimated number of secondary infections caused by an infectious individual in the modeled population.

Model progress continues, though there's a lot to work on - both in the general framework and the specific analyses.

devosr commented 9 years ago

I spent some time working the model for R0 and would very much like your opinion on it. The red line is the result.

The spreadsheet column headings are shown below.

The method starts with a series of filters / calculations to provide stable numbers.

I calculate the number of new live cases, using exponential smoothing to interpolate back several weeks for a stable number interpolation is used for the following new live infected cases in the last WW weeks new live infectious cases in the last 3 weeks

R0 is calculated as A/B where A = number of new infected cases in the last WW weeks B = net live contagious cases, TT weeks ago after assuming 10% mortality (during previous 5 weeks)

The number seems high but I can't see any optimistic assumptions

there are a couple knobs to tweak in the weeks at the top of the columns.

The equations should be good but a second set of eyes might help.

I'm afraid the method is a bit unconventional but it's hard to draw conclusions when you only have two columns of data.

[image: Inline image 2]

daysweeksdaysweeksdaysweeksdays of exposure that a person is infected (WW)35 5days of exposure that a person is contagous (XX)355average time to infect (TT)213total casescases WW weeks agonew infected cases in WW weeksnew casestotal deathscurrent live caseslive cases XX weeks agonew live infectious cases in XX weeksnet current contagous after 10% mortality (during TT weeks) assume 90% can infect othersnet live contagous TT weeks ago after 10% mortality (during previous 5 weeks)R012/6/2013112/6/20131.000.001.001.001.003/25/2014 8620.5744968765.4255031385593/25/201427.009.3717.6315.879.327.02

On Sun, Sep 7, 2014 at 11:02 AM, Grant Brown notifications@github.com wrote:

Hi Rick,

Your first point highlights one of the benefits of the SEIR framework - these sort of dynamics are "baked in", so to speak.

As for the second, point, I'm not quite ready to comment. I've been thinking this morning about the estimation method I'm using for the basic reproductive number, and am considering some changes. First and most importantly, I think the current formulation may make unreasonable assumptions about the recovery period; I developed the methods while working with flu data on a weekly scale, and hadn't revisited them. Second, it seems to me that there are really two separate quantities of interest: the usually defined "basic reproductive rate" which captures the number of secondary infections caused by a single infectious individual in a fully susceptible population, and what we might call the "effective reproductive rate", which should capture the estimated number of secondary infections caused by an infectious individual in the modeled population.

Model progress continues, though there's a lot to work on - both in the general framework and the specific analyses.

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-54749197 .

devosr commented 9 years ago

https://docs.google.com/spreadsheets/d/1wXF2_Pmv3KLMDEk8EiNplE2xIncP2XZRalXOEXojUZ8/edit?usp=sharing

On Sun, Sep 7, 2014 at 11:02 AM, Grant Brown notifications@github.com wrote:

Hi Rick,

Your first point highlights one of the benefits of the SEIR framework - these sort of dynamics are "baked in", so to speak.

As for the second, point, I'm not quite ready to comment. I've been thinking this morning about the estimation method I'm using for the basic reproductive number, and am considering some changes. First and most importantly, I think the current formulation may make unreasonable assumptions about the recovery period; I developed the methods while working with flu data on a weekly scale, and hadn't revisited them. Second, it seems to me that there are really two separate quantities of interest: the usually defined "basic reproductive rate" which captures the number of secondary infections caused by a single infectious individual in a fully susceptible population, and what we might call the "effective reproductive rate", which should capture the estimated number of secondary infections caused by an infectious individual in the modeled population.

Model progress continues, though there's a lot to work on - both in the general framework and the specific analyses.

— Reply to this email directly or view it on GitHub https://github.com/grantbrown/libspatialSEIR/issues/1#issuecomment-54749197 .

grantbrown commented 9 years ago

Hmm, I haven't studied R0 estimation in that context (it's a bit different for every modeling approach), but it's usually done as a function of the parameter estimates. All of my numbers are now between 0 and 3.5, so if yours deviate widely from that it's probably a function of the assumptions about case duration.

grantbrown commented 9 years ago

@developmentfiend I've been working on the first non-degenerate data model available to the software, which relaxes the assumption that the new cases are known quantities in favor of the assumption that the new case counts are sampled with (mean zero) error. This is the first step towards trying to estimate the degree to which cases are underestimated - the next step is to build a data model which assumes that only a proportion of cases are reported, though getting that working and tested may take some time.

In any case, for the test analysis I increased the prediction window for this analysis as requested to 60 days and added a table of predictions. Be very, very, extremely cautious interpreting anything about the results presented there. I have a hunch that the limited bed space and reporting ability of MSF and WHO in recent days has capped the recorded new cases, and as a result biased the trend downward.

http://grantbrown.github.io/Ebola-2014-Analysis-Archive/CurrentOverdispersed/Ebola2014.html

developmentfiend commented 9 years ago

Hi again! Just checking to see if you have any updates.

Wondering if the CDC base projections may be helpful for assuming actual versus reported case #s?

It looks like their base is 20K for 9/30... would you be able to run your model forward with those numbers? There were very few specifics given in the CDC's report and containment was contingent on 70% hospitalization, which is -- at this point -- completely unfeasible.

Any possibility of modeling spread to neighboring countries, as well? Obviously that is much more complicated but I would figure as the raw % of countries infected begins to enter the single digits, contagion will occur very rapidly.

grantbrown commented 9 years ago

Hi deveopmentfiend,

I am indeed working on updates, though I've been developing the methodology along the way and that always has the potential to introduce bugs. I want to make sure the model continues to do the right thing before posting - shouldn't be much longer now. All updates are going to continue to be reflected at this link:

http://grantbrown.github.io/libspatialSEIR/doc/tutorials/Ebola2014/Ebola2014.html

Modeling spread to additional countries is tough, and would require introducing fairly strong assumptions about how people move between them. It's certainly something to consider for the future, but may be beyond the scope of this particular analysis simply due to time constraints.

I'll definitely take a look at the new CDC data, perhaps it will become reflected in the Wikipedia source I'm using also.

grantbrown commented 9 years ago

Sorry for the delay, I've finally updated the analysis:

http://grantbrown.github.io/libspatialSEIR/doc/tutorials/Ebola2014/Ebola2014.html

I've moved away from the temporal basis functions and instead generalized the spatial structure - this has some interesting effects on the three methods of R0 estimation we're looking at now.

devosr commented 9 years ago

very nice work ... it was worth the wait ... your new methodology addresses a number of the earlier issues. several observations: 1) The confidence bands are very tight .. probably accurately restricted to the statistical confidence of the correlation. However, the confidence band assumptions might be broadened to include confidence in the seir factors over the whole period. they should widen when we are further from known data. 2) I found the tail up at the end of the known data in my efforts as well. They were driven by the equal weight placed on the early E-I transition coefficient. I had to reduce the weight of the oldest data to improve the ability to extrapolate. 3) in fact, the intensity factors influencing the EI transition vary with time and politics. The most recent data in liberia suggests that the growth rate has been reduced by the intense effort, communication, education and manpower. 4) i'm glad you added the charts with the infectious count. This is an excellent indicator of the success in efforts to drive down growth.

Again... nice work

grantbrown commented 9 years ago

Thanks Rick. The recent changes were mainly driven by a re-evaluation of what the temporal implications of the spline basis I was using were (aka, probably not reasonable long term), along with testing out new software features (multiple distance parameters etc.)

As for your comments:

I think the narrow bands are exaggerated by the extreme change in scale. The confidence bands are to be interpreted vertically with respect to the X axis, so a steep graph appears to the human eye to have artificially narrow confidence band width. I think that the narrowness may also be a product of the simplification of the temporal basis - in fact, I'd bet that it's over simplified. I have some vague thoughts on how to better balance the trade-off between arbitrary (but flexible) spline bases and meaningful (but probably too simple) time invariant location parameters, but there's a lot of work left to do.
I'm not following this one - the E-I transition probability models the time that it takes for an infected person to become infectious. There's a latent period associated with Ebola (2-21 days iirc), and that's what we're estimating here. In this model it isn't allowed to vary over time, as it's a property of the pathogen.
Although the E-I transitions don't behave as you describe for this particular model, I certainly hope that you're correct about the result of the increasing international response in Liberia. Detecting that kind of change is where a better temporal basis would really be beneficial.
Thanks. This was a request from a while back that I just got around to implementing. I also added a table (hidden by default in a code block) which records the raw data (rightly requested by jf22 a while ago). The folks at Wikipedia seem to change their sources and formatting constantly, so having a record of what it looked like at the time will hopefully be helpful.