DAGs that are not transitively reduced

phillipnicol / OncoBN

Oncogenetic network estimation with Bayesian Networks

1 stars 2 forks source link

DAGs that are not transitively reduced #5

Open rdiaz02 opened 2 years ago

rdiaz02 commented 2 years ago

With the attached data I ran OncoBN as follows:

library(OncoBN)
d1 <- read.table("d1.txt", header = TRUE, sep = "\t")
fit <- fitCPN(d1, model = "CBN", algorithm = "DP", k = 3)

and was surprised to find a DAG that is not transitively reduced. This is easy to see here:

library(igraph)
plot(fit$graph, layout = layout_as_tree)

(attached too)

Node B depends on nodes D and C but C itself depends on D
Node A depends on both A dn B, but B itself depends on A.

I would have expected not to see:

an edge from D to B (since B depends on C which depends on D)
an edge from C to A (since A depends on B which depends on C)

I am not sure how to interpret the output. And I think CBN itself (Gerstung et al., 2009, for example) does not return DAGs that are not transitively reduced.

What am I missing? d1.txt

phillipnicol commented 2 years ago

It seems like both graphs (the given one and the one without edges from D to B and C to A) have the same likelihood. And both graphs assign the same probability to each genotype since C itself requires B.

It seems like the algorithm picks the more complicated graph when there is a tie in the likelihood. I agree this makes the result harder to interpret. I think this should be a relatively easy fix, and I will update when I can get on a computer tomorrow.

phillipnicol commented 2 years ago

Actually the two models have slightly different likelihoods... Under the graph printed above we have

P(B = 0 | C = 1, D = 0) = 1 - epsilon

Because both parents C and D are not equal to 1. If we remove the the edge from D to B this probability becomes

P(B = 0 | C = 1, D = 0) = 1 - theta_b

because now C is the only parent of D.

In the data, there are at least 10 observations with (B = 0, C = 1, D = 0) (see d1[59,]). Because epsilon is small, 1 - epsilon will tend to be larger than 1 - theta_b which favors the model that includes an edge from D to B.

However, when removing D to B the log likelihood only decreases by 2. It might make sense to include an AIC/BIC type penalty on the number of edges to avoid situations like this.

rdiaz02 commented 2 years ago

Thanks for the detailed analysis! If I understand correctly, there are two issues: a) model choice itself, which might resolve things in favor of the smaller, transitively reduced models when using AIC/BIC; b) interpretation. Regarding b) the difference with respect to CBN (Gerstung et al., 2009, for instance) arises because of epsilon in OncoBN. This makes a lot sense now.