Open huilisabrina opened 4 years ago
Hi all,
Good news - the first item for Function enhancement and Parallelization on our to-do list ^ are both checked off now! :)
I've been trying to replace all of the df_nodes
and df_edges
in the serial version of the network updating functions, with our GF object. This step has been completed now, and the new codes are in network_update_GF.py
.
I figured along the way that we can easily "hack" the issues about keeping track of time steps by simply adding another property to the vertices attribute of the graph. Please find more details in the implementation. It took me some time to learn the pyspark.sql.dataframe languages haha. Look forward to testing more thoroughly on this!
Thanks, Hui
Remaining items are incorporated in the new list of action points, see #3
Hi @smwu , @intekhab8 and @beancamille ,
I've just uploaded a script called
network_update_serial.py
. Here are some detailed explanations of what was done and what we can do later:The script contains a few components:
Functions that implement the
H --> R,D
,I --> R, H, D
,E --> I
andI --> E --> S
steps. These functions were adapted from the epidemic network SIR model that Stephanie shared earlier. The main modifications were changing the binomial probability of state transition to multinomial for H/I nodes, and also added the E state to account for latent period and infectious period.Note: For
E --> I
andI --> E --> S
to work, I think we need to somehow record the number of periods elapsed since a person becomes E or I. In other words, we want a E node to start infecting people aftert_latent
days have elapsed; an I node can no longer infect people aftert_infectious
. To achieve this, I'm thinking we can define aNode
class and add an attribute, e.g.Node.elapsed_days()
to keep track of the number of days since a node becomes E or I. Initializenode.elapsed_days()
to zero and add one to this value after each time step. The exception is that an I node transitions to R,H or D. I tried doing this but was not very successful. So the current codes effectively assume thatt_latent
is 1 (i.e. E becomes immediately infectious during the next period), andt_infectious
is infinite (i.e. an I node can always infect people as long as it does not transmission to other states).Simulation wrapper that calls these functions to update the graph. Its input are the parameters for the SEIRHD model, and its output is a table that traces the number of nodes in each category over time (i.e. rows are time period index). It also outputs duration, which is less than or equal to the maximum number of time steps we set, but may be less if the number of infectious nodes have reduced to zero. This function was also adapted from Stephanie's notebook.
Set up a toy example (two dataframes for nodes and edges) and run the simulation using a few parameter values. This step will eventually be replaced by the dataset Inte provided. For scalability testing, we can tweak the toy example to keep it simple.
Note: It would be great to work with the toy example (for now) and add a
cluster
column to the nodes, to mimic the non-overlapping samples in the dataset we'll be using. We'd need to modify the function so that it can take advantage of these disjoint sets of nodes, and run updating in parallel (see below).Here are some of the action points:
Function enhancements:
[x] Enable
t_latent
andt_infectious
. One way is that we can define anode
class and add an attribute to it to record the number of days elapsed since it becomes of type E or I.[x] Enable a single I node to infect multiple neighboring S nodes at one time step. Current codes only allow for at most one infection per person per period.
Parallelization:
[x] Integrate the updating functions with GraphFrames (I'll try to write another issue to explain how to set this up. This set of codes does not involve GF yet).
[ ] Adapt the functions so that it can exploit the
cluster
or non-overlapping structure of the network data. Run updating in parallel for different clusters. Test it by adding acluster
column to the nodes dataframe.Nice to have:
[ ] Visualize the progression of number of nodes in each category over time. This would be handy for testing hypotheses, when we need to see the impact of changing parameter values, etc.