Initial Structure - Githubissues

nutterb / dbn

Simulation of Discrete-Time Dynamic Bayesian Networks

0 stars 0 forks source link

Initial Structure #2

Closed nutterb closed 7 years ago

nutterb commented 8 years ago

I suspect the initial setup will be similar to HydeNet.

We will need an object on which to act, so let's call that dbn. It will need the following attributes

[x] network formula (the call?)
[x] adjacency matrix
[x] node attributes table (for this, I'm considering leveraging tibble (tbl_df) objects, which allow and models to be stored in the table)

The node attributes need to include

[x] node name
[x] parents
[x] is_decision
[x] is_utility
[x] is_temporal (FALSE means it's a global variable, such as sex)
[x] model

Rather than gathering all of the information to build a model, let's instead have them directly insert the model. If we aren't pushing the simulation out to JAGS, we don't need as diverse an object to accommodate direct JAGS coding.

The only immediate downside I see to this is it won't let use simulate from models for which we don't have an object; such as a published model. I'm not quite sure how to handle that yet.

Need a formula and a list method.

[x] formula method
[x] list method

We will have to restrict the use of xtabs to a single variable. Unless I can be given a reasonable interpretation of what a multivariable xtabs should look like in a network.

[x] xtabs with more than one variable prohibited.

jarrod-dalton commented 8 years ago

I don't know - this is complicated. I'm not suggesting an alternative here, but just jotting down some thoughts.

A really simple-minded object hierarchy would be something like

tbn | --> temporalNode | --> staticNode

With some other detail & sub-classes, obviously.

I guess the point is that the object structure might look different enough for temporalNodes (compared to staticNodes) that we might want to make them explicit classes.

For temporalNodes, we would need to specify node dependencies (names of nodes on which the node in question depends, not worrying about the nature of the relationships).

There are some important details here:

static (or global, if you prefer) nodes cannot depend on temporal nodes - only other static nodes.
temporal nodes can depend on static nodes
when a temporal node depends on another temporal node, the user needs to make explicit the length of the dependency in time steps, with the default being 1 time step (i.e., Markov model).

For example, we might want to model cholesterol(t) as a function of sex, age((t-5):(t-1)) and cortisol((t-2):(t-1)). We'd need to specify: 1) that cholesterol, age and cortisol are temporalNodes 2) that the sex node is a staticNode 3) that cholesterol node depends on sex, age and cortisol 4) that the dependency of cholesterol on age involves the previous 5 values of age (t-1, t-2, ..., t-5)and that the dependency of cholesterol on cortisol depends on the previous 2 values of cortisol (t-1 and, t-2) 5) equations/distributions/models for all nodes, being careful that such equations/distributions/models are only allowed to reference what has been specified in the dependencies

Of course, an automatic means of populating the dependencies based on what's embedded in the model objects would be very helpful.

jarrod-dalton commented 7 years ago

Just beginning to play around with this. Sorry about the delay. What follows are some rather free-form notes as I work my way through the code. I'll leave it up to you to parse this into notable issues (if any).

I noticed that dag_structure() is unexported. Did you intend that?
Slightly concerned about performance. We should consider overhead involved with exporting dbn objects to a cluster and running simulations in parallel. It would be nice to eventually just have a parallel = TRUE optional parameter in the main simulation function.
On a related note that I think I may have mentioned before (or at least thought about), we should consider abandoning the use of Bayesian simulation, at least up front. There are at least two implications of this:
1. It's (obviously) not really all that Bayesian. All we'd be doing is simulating forward in time, without making any inferences "upstream" from a child node to a parent node. We can't get away with calling this a dynamic Bayesian network if we don't eventually make it Bayesian (i.e., allow inference back in time for unobserved nodes). But probably, the majority of use cases will involve just predicting forward in time. Under this restriction, computations can be sped up significantly, using vectorized calls to rnorm, rbern, etc. without having to do it in parallel.
2. It would make the initial implementation much easier and allow us to focus on getting the model structure right.
3. We need to think about how the package might smartly choose among straight-up stochastic simulation vs. some Bayesian algorithm (MCMC, no-U-turn sampler, etc.). Maybe JAGS & Stan already does this smartly, which would imply that we should ignore all my banter here and just implement it in one of those frameworks. (There are downsides to that as well.)

nutterb commented 7 years ago

I did intend for dag_structure to be unexported. I am just storing it in its own file, instead of defining it in dbn. I don't normally export utility functions unless there is a belief that the user may find it useful in other settings. I can't think of a use case for dag_structure elsewhere right now.
parallelization is fairly simple to accomplish, and will be considered when I actually get to simulation. I'll prepare that argument in whatever function does the simulations.
I had been operating under the assumption that we were abandoning the truly Bayesian approach. This had been one of the reasons I was exploring package names around "Dynamic Systems" instead of "Dynamic Bayesian." If you want to grow this into the Bayesian, eventually, I'll need to rethink the strategy again. It won't change a lot, but if this is eventually going Bayesian, doing a purely Bayesian implementation makes a lot more sense. JAGS and STAN are both inherently multi-threaded, (eliminating the need for us to implement parallel methods) and are indifferent to whether we are searching forward or backward in time (eliminating the need for us to implement checks that we only go forward). If this is where we want to go, I'll need to think hard about whether dbn should be an extension of HydeNet, or an alternative to HydeNet or as a HydeNet 2.0. The answer to that question isn't immediately obvious to me.

jarrod-dalton commented 7 years ago

How about a copout? "dbn" could just as well stand for "dynamic belief networks", which doesn't explicitly address whether or not they are truly Bayesian inference machines. I do like the idea of sticking with a strictly forecasting (now into the future, no past) package, for reasons stated above. It also allows for a much richer class of models that could feasibly be incorporated into the system (e.g., any function that takes parents as inputs and outputs some vector of predicted values for the child node)