Functioning network updating algorithm for SEIRHD model

huilisabrina commented 4 years ago

Hi @smwu , @intekhab8 and @beancamille ,

I've just uploaded a script called network_update_serial.py. Here are some detailed explanations of what was done and what we can do later:

The script contains a few components:

Functions that implement the H --> R,D, I --> R, H, D, E --> I and I --> E --> S steps. These functions were adapted from the epidemic network SIR model that Stephanie shared earlier. The main modifications were changing the binomial probability of state transition to multinomial for H/I nodes, and also added the E state to account for latent period and infectious period.

Note: For E --> I and I --> E --> S to work, I think we need to somehow record the number of periods elapsed since a person becomes E or I. In other words, we want a E node to start infecting people after t_latent days have elapsed; an I node can no longer infect people after t_infectious. To achieve this, I'm thinking we can define a Node class and add an attribute, e.g. Node.elapsed_days() to keep track of the number of days since a node becomes E or I. Initialize node.elapsed_days() to zero and add one to this value after each time step. The exception is that an I node transitions to R,H or D. I tried doing this but was not very successful. So the current codes effectively assume that t_latent is 1 (i.e. E becomes immediately infectious during the next period), and t_infectious is infinite (i.e. an I node can always infect people as long as it does not transmission to other states).
Simulation wrapper that calls these functions to update the graph. Its input are the parameters for the SEIRHD model, and its output is a table that traces the number of nodes in each category over time (i.e. rows are time period index). It also outputs duration, which is less than or equal to the maximum number of time steps we set, but may be less if the number of infectious nodes have reduced to zero. This function was also adapted from Stephanie's notebook.
Set up a toy example (two dataframes for nodes and edges) and run the simulation using a few parameter values. This step will eventually be replaced by the dataset Inte provided. For scalability testing, we can tweak the toy example to keep it simple.

Note: It would be great to work with the toy example (for now) and add a cluster column to the nodes, to mimic the non-overlapping samples in the dataset we'll be using. We'd need to modify the function so that it can take advantage of these disjoint sets of nodes, and run updating in parallel (see below).

Here are some of the action points:

Function enhancements:
[x] Enable t_latent and t_infectious. One way is that we can define a node class and add an attribute to it to record the number of days elapsed since it becomes of type E or I.
[x] Enable a single I node to infect multiple neighboring S nodes at one time step. Current codes only allow for at most one infection per person per period.
Parallelization:
[x] Integrate the updating functions with GraphFrames (I'll try to write another issue to explain how to set this up. This set of codes does not involve GF yet).
[ ] Adapt the functions so that it can exploit the cluster or non-overlapping structure of the network data. Run updating in parallel for different clusters. Test it by adding a cluster column to the nodes dataframe.
Nice to have:
[ ] Visualize the progression of number of nodes in each category over time. This would be handy for testing hypotheses, when we need to see the impact of changing parameter values, etc.

huilisabrina commented 4 years ago

Hi all,

Good news - the first item for Function enhancement and Parallelization on our to-do list ^ are both checked off now! :)

I've been trying to replace all of the df_nodes and df_edges in the serial version of the network updating functions, with our GF object. This step has been completed now, and the new codes are in network_update_GF.py.

I figured along the way that we can easily "hack" the issues about keeping track of time steps by simply adding another property to the vertices attribute of the graph. Please find more details in the implementation. It took me some time to learn the pyspark.sql.dataframe languages haha. Look forward to testing more thoroughly on this!

Thanks, Hui

huilisabrina commented 4 years ago

Remaining items are incorporated in the new list of action points, see #3

huilisabrina / covid-19-simul

Functioning network updating algorithm for SEIRHD model #1