phillipnicol / OncoBN

Oncogenetic network estimation with Bayesian Networks
1 stars 2 forks source link

After commit 3276ad0 some thetas exactly equal to 0 or 1 #6

Open rdiaz02 opened 2 years ago

rdiaz02 commented 2 years ago

For the data set attached (d333.txt), this is what I get with commit aa039d3:

d333 <- read.table("d333.txt", header = TRUE, sep = "\t")
fitA_C <- fitCPN(d333, model = "CBN")
fitA_D <- fitCPN(d333, model = "DBN")
> fitA_C$theta
[1] 0.7924 0.8151 0.8093 0.8764 0.3083 0.9420
> fitA_D$theta
[1] 0.7924 0.8537 0.8093 0.9808 0.2862 0.8552

If I run this with commit 3276ad0, in addition to changes in the graph, some thetas are identical to 1.0 or 0.0.

d333 <- read.table("d333.txt", header = TRUE, sep = "\t")
fit3_C <- fitCPN(d333, model = "CBN")
fit3_D <- fitCPN(d333, model = "DBN")
> fit3_C$theta
     A      B      C      D      E      F 
0.7924 0.8151 1.0000 0.8764 0.0000 0.9420 
> fit3_D$theta
     A      B      C      D      E      F 
1.0000 0.8151 0.7868 0.9808 0.0000 0.8552 
> fit3_C$theta == 0
    A     B     C     D     E     F 
FALSE FALSE FALSE FALSE  TRUE FALSE 
> fit3_D$theta == 0
    A     B     C     D     E     F 
FALSE FALSE FALSE FALSE  TRUE FALSE 

Note that there are observations for the gene with theta = 0:

> colSums(d333)
   A    B    C    D    E    F 
1783 1834 1443 1972  644 1234 

I can compute with thetas of 1 and 0, but I do not understand a theta of 0: what is the model really saying? Moreover, thetas of 1 and 0 lead to predicted genotype frequencies of exactly 0. For example, for model fit3_C, above we have:

In addition, these patterns can affect genes that are not terminal leaves, but internal nodes such as E in the next example:

d222 <- read.table("d222.txt", header = TRUE, sep = "\t")
fit2_D <- fitCPN(d222, model = "DBN")
> fit2_D
$edgelist
 [1] "C"  "A"  "WT" "B"  "F"  "C"  "F"  "D"  "E"  "D"  "C"  "E"  "B"  "F" 

$score
[1] -3106

$graph
IGRAPH bd08065 DN-- 7 7 -- 
+ > attr: name (v/c)
+ > edges from bd08065 (vertex names):
[1] C ->A WT->B F ->C F ->D E ->D C ->E B ->F

$theta
     A      B      C      D      E      F 
1.0000 0.8151 1.0000 0.9808 0.0000 0.6728 

$model
[1] "DBN"

$epsilon
[1] 0.1431

And we get some genotypes with predicted probabilities exactly 0. For example: BF (because theta_C = 1), BCF (because theta_A = 1), BCFE, BCDEF because theta_C = 0, etc.

Granted, these are synthetic data and these could be corner cases, but I wonder if I am missing something. d222.txt

d333.txt

phillipnicol commented 2 years ago

Sorry for the late reply, but I had a chance to look at d333.txt. For the CBN model, we see that the estimated theta for mutation C is 1. Because A and B are the parents of C this means that the p(C | A and B) (probability of mutation C given both A and B) should be 1. Indeed, I see that sum(d333$A & d333$B) and sum(d333$A & d333$B & d333$C) are both equal to 1443.

For mutation E, the parent is C. So a theta equal to 0 should mean that p(E | C) = 0. We have sum(d333$C & d333$E) = 0. Any time that E = 1 the model considers it a "spontaneous activation" (which occurs with probability epsilon). When I change epsilon = 0 (0 tolerance for mutations not following the DAG rule) I see that E is a child of the wild type (which is expected). I am going to think a little bit more about whether or not there may be a better way to automatically choose the epsilon.

As you noticed I corrected a bug in the latest update, so that would be the reason for the observed changes. I still need to look at d222.txt, so let's keep this issue and I will get back to you shortly. Thanks!