sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Output Causal Matrix Discrete Values vs Continuous Values #21

Closed Cby19961020 closed 3 years ago

Cby19961020 commented 3 years ago

Hi I have a question regarding the output function regarding the output causal matrix for learn.dynamic.network. After we perform the structural learning using SEM I will receive a causal matrix that is like this:

mydata <- BNDataset("test_fault_free.data", "test_1.header", starts.from=0) net <- learn.dynamic.network(mydata, algo = "sem",scoring.func = "BIC",num.time.steps=1)

net <- learn.network(mydata, algo = "sem",scoring.func = "BIC")

d1 <- dag(net) print(d1) plot(net)

 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]

[1,] 0 1 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 1 0 [4,] 0 1 1 0 0 0 1 0 [5,] 0 0 0 1 0 0 0 1 [6,] 0 0 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 0 0 [8,] 0 1 1 0 0 0 1 0

and I am trying to plot the ROC curve however because the numbers are all discrete(1 or 0) I cannot do this. For other methods out there such as a package called (SLARAC) the causal matrix looks something like this: Score matrix for the adjacency matrix as inferred by SLARAC (post_standardised):

[[-0.25 -0.26 0.44 -0.24 -0.2 -0.22 -0.11 -0.25] [-0.25 -0.22 0.57 -0.13 -0.23 -0.22 0.07 -0.2 ] [-0.27 -0.27 1.37 -0.24 -0.27 -0.26 -0.27 -0.27] [-0.26 -0.23 0.42 -0.01 -0.25 -0.24 -0.12 -0.24] [-0.26 -0.26 -0.22 -0.26 -0.22 -0.25 -0.26 -0.27] [-0.26 -0.26 -0.15 -0.26 -0.26 -0.24 -0.26 -0.27] [-0.26 -0.26 0.32 -0.16 -0.26 -0.26 1.43 -0.24] [-0.24 0. 7.5 -0.22 -0.2 -0.15 -0.23 0.28]]

Which can be used to plot the ROC curve. I am wondering if I can receive a continuous variable causal matrix for bnstruct too. Please kind me let me know. Thank you and stay safe!

albertofranzin commented 3 years ago

Hi,

using a standard learning procedure, you can get only a binary matrix (arc present/absent). If you use bootstrap, you can instead get a matrix called wpdag (weighted partially dag), where each 'edge' a_ij is a continuous value representing the confidence inferred for each edge using the bootstrap samples: higher values are assigned to edges that are more likely to appear in the real original network.

I'm not familiar with SLARAC, but from a very quick glance it seems to me that it is computing the same kind of information (with a different algorithm of course).

Hope this helps!

Cheers

Cby19961020 commented 3 years ago

Hi friend,

Thank you for your prompt reply. Please excuse my innocence but the way I see it is that bootstrap(achieved using learn.network() ) is a different structural learning algorithm implemented (that is not equivalent to learning dynamic Bayesian network using SEM) and we can only receive binary matrix when using SEM(achieved using learn.dynamic.network() ). Could you please verify this for me? Also, since I will work with sparsely sampled data I am also wondering if bootstrap can handle missing values. Please kindly let me know. Thanks again! Your expertise is much appreciated!

Regards, Bo

albertofranzin commented 3 years ago

Hi Bo,

yes, I have to clarify. First, learn.dynamic.network() is just an interface for learn.network(), so they work the same way.

You can impute missing values in the bootstrap samples with dataset <- bootstrap(dataset, 100, imputation = TRUE) if dataset contains missing data. What happens is that kNN is applied separately to each sample.

The generation of bootstrap samples, at least as implemented in bnstruct, is an operation done on a dateset and in principle not related to the learning algorithm used. However, for computational reasons, SEM does not operate on bootstrap samples. In case of complete data in the original dataset, SEM will fall back to MMHC, which will make use of the bootstrap samples.

> d <- asia()
> d <- bootstrap(d, 100)
... bnstruct :: Generating bootstrap samples ...
... bnstruct :: Bootstrap samples generated.
> n <- learn.network(d, bootstrap=T)
... bnstruct :: learning the structure using MMHC ...
... bnstruct :: learning using MMHC completed.
> wpdag(n)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0   94   87    0    6    0    8    3
[2,]   86    0    0   11    8  100   13    7
[3,]   96    0    0    0   16    0    4   86
[4,]    1    9    2    0   70    0    0    5
[5,]    7    3   22   68    0   99   12    3
[6,]    0    3    0    0    4    0    0    0
[7,]    9    1    3    0    1    0    0   87
[8,]    2    2   20   11    3    0   16    0

If you try to use SEM, you will get the same results:

> n <- learn.network(d, algo="sem", bootstrap=T)
... bnstruct :: learning the structure using SEM ...
... ... bnstruct :: no missing values found, learning the network once
... ... bnstruct :: learning the structure using MMHC ...
... ... bnstruct :: learning using MMHC completed.
... bnstruct :: learning using SEM completed.
> wpdag(n)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0   94   87    0    6    0    8    3
[2,]   86    0    0   11    8  100   13    7
[3,]   96    0    0    0   16    0    4   86
[4,]    1    9    2    0   70    0    0    5
[5,]    7    3   22   68    0   99   12    3
[6,]    0    3    0    0    4    0    0    0
[7,]    9    1    3    0    1    0    0   87
[8,]    2    2   20   11    3    0   16    0

Using the child dataset, which has missing values (I use 10 samples in this example):

> d <- child()
> d <- bootstrap(d, 10, imputation=TRUE)
... bnstruct :: Generating bootstrap samples with imputation ...
... bnstruct :: Bootstrap samples generated.
> n <- learn.network(d, bootstrap=T, use.imputed.data=T)
... bnstruct :: learning the structure using MMHC ...
... bnstruct :: learning using MMHC completed.
> wpdag(n)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
 [1,]    0    7    0    0    0    0    0    0    0     0     0     0     0
 [2,]   10    0   10    9    7   10    8   10   10     3     0     0     0
 [3,]    0    6    0    0    0    0    0    0    6     0     0     0     0
 [4,]    0    7    0    0    0    0    0    0    0     0     0     0     0
 [5,]    0   10    0    0    0    0    0    0    0    10     0     0     0
 [6,]    0    7    0    0    0    0    0    0    0     7    10     0     0
 [7,]    0    6    0    0    0    0    0    0    0     0    10    10     8
 [8,]    0    7    0    0    0    0    2    0    0     0     0     0    10
 [9,]    0   10   10    0    0    0    0    0    0     0     0     0     0
[10,]    0    3    0    0    1    0    0    0    0     0     0     0     0
[11,]    0    0    0    0    0    0    0    0    0     0     0     0     0
[12,]    0    0    0    0    0    0    6    0    0     0     0     0     0
[13,]    0    0    0    0    0    0    2    0    0     0     0     0     0
[14,]    0    0    0    0    0    0    1    0    1     0     0     0     0
[15,]    0    2    0    8    0    0    0    0    0     0     0     0     0
[16,]    0    0    0    0    0    0    0    0    0     0     0     0     0
[17,]    0    0    0    0    0    0    0    0    0     0     1     0     0
[18,]    0    0    0    0    0    0    1    0    0     0     0     7     0
[19,]    0    1    0    0    0    0    0    1    0     0     0     0     2
[20,]    0    0    0    0    0    0    0    0    0     0     0     0     0
      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]     0     0     0     0     0     0     0
 [2,]     0     2     0     0     0     1     0
 [3,]     0     0     0     0     0     0     0
 [4,]     0     9     0     0     0     0     0
 [5,]     0     0     0     0     0     0     0
 [6,]     0     0     0     0     0     0     0
 [7,]     9     0     0     0     1     0     0
 [8,]     0     0     0     0     0     2     0
 [9,]    10     0     0     0     0     0     0
[10,]     0     0    10     0     0     0     0
[11,]     0     0    10    10     0     0     0
[12,]     0     0     0     0    10     0     0
[13,]     0     0     0     0     0     8     0
[14,]     0     0     0     0     1     1    10
[15,]     0     0     0     0     0     0     0
[16,]     0     0     0     0     0     0     0
[17,]     0     0     0     0     0     0     0
[18,]     0     0     0     0     0     0     0
[19,]     0     0     0     0     0     0     0
[20,]     1     0     0     0     0     0     0

Cheers.

Cby19961020 commented 3 years ago

Hi there,

Thank you very much for your kind elaboration! I find the example/explanation very helpful! I know this is a silly question to ask but you mentioned that SEM is not used when working with missing data due to computation time and MMHC algorithm is instead used. I do want to know how slow can SEM really get.

I am working with 8 variables, with 1000 entries(with ~10 percent missing values) and when I apply net <- learn.dynamic.network(mydata, algo = "sem",scoring.func = "BIC",num.time.steps=1) the algorithm can spit out a diagram for me within a minute. Say if we force to learn using SEM instead of MMHC, how long do you anticipate it to be? And is there a way to force it to learn using SEM?

I apologize for all the silly follow up questions I ask, please forgive the lack of background knowledge I have in this area. Thank you again and please kindly let me know.

Regards, Bo

albertofranzin commented 3 years ago

It's not a silly question at all, but it's impossible to give a general answer, because it depends on many factors such as the number of variables, the number of observations, the structure of the network (at least the reconstructed one), proportion of missing data, type of missing data (completely at random, completely at random, not random), ...

In your case, I guess we can estimate that the time a bootstrapped SEM would take will be roughly (# samples) minutes. There is no way of forcing bootstrap with SEM in the package, but you can do it outside. If you take a look at the code of learn.structure() in R/learn-methods.R you'll see that learning with bootstrap learns a new network for each sample, and composes the dags obtained. If you want to try, you can do the same for SEM outside the package.

Cheers.