sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Error in reading the data from file #29

Closed BKAmos closed 1 year ago

BKAmos commented 1 year ago

Hello,

I'm trying out this package and can't seem to read data from files, what am I missing? I've provided a short example below.

library("bnstruct")
dataset.from.file <- BNDataset("genusDataOnlyShort.txt", "GenusHeadersShort.txt", num.time.steps = 3, starts.from = 1)
genusDataOnlyShort.txt
1   1   1   1   1   1   2   2   2
1   1   1   1   1   1   2   2   2
1   1   2   1   1   1   1   2   2
1   1   2   1   1   1   2   1   2
GenusHeadersShort.txt
'Achromobacter_t0'  'Acidovorax_t0' 'Bosea_t0'  'Achromobacter_t1'  'Acidovorax_t1' 'Bosea_t1'  'Achromobacter_t2'  'Acidovorax_t2' 'Bosea_t2'
2   2   2   2   2   2   2   2   2
d   d   d   d   d   d   d   d   d

And I get the below error.

Error in read.dataset(dataset, data, discreteness, ...) : 
  Incoherent number of variable names in the dataset header (mismatch between header and data).

Any thoughts? What am I missing?

albertofranzin commented 1 year ago

Hi,

apparently the method is less flexible than expected, sorry for that.

It works by replacing tabs with spaces and the ticks with quotes (' with "). In this case I get

$ cat GenusHeadersShort.txt 
"Achromobacter_t0" "Acidovorax_t0" "Bosea_t0" "Achromobacter_t1" "Acidovorax_t1" "Bosea_t1" "Achromobacter_t2" "Acidovorax_t2" "Bosea_t2"
2 2 2 2 2 2 2 2 2
d d d d d d d d d
$ cat genusDataOnlyShort.txt 
1 1 1 1 1 1 2 2 2
1 1 1 1 1 1 2 2 2
1 1 2 1 1 1 1 2 2
1 1 2 1 1 1 2 1 2

> dataset.from.file <- BNDataset("genusDataOnlyShort.txt", "GenusHeadersShort.txt", num.time.steps = 3, starts.from = 1)
Messaggi di avvertimento:
1: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
2: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
3: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
4: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
5: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
6: In validityMethod(object) :
   Not all of the possible values have been observed for variables  1 2 4 5 6 9
> dataset.from.file

Dataset: 

num.variables 9

variables
Achromobacter_t0 Acidovorax_t0 Bosea_t0 Achromobacter_t1 Acidovorax_t1 Bosea_t1 Achromobacter_t2 Acidovorax_t2 Bosea_t2
discreteness
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
node.sizes
2 2 2 2 2 2 2 2 2
num.items
4
imputation
FALSE
has.boots
FALSE
has.imputed.boots
FALSE
num.boots
0

Hope this helps, let me know if you still get issues.

BKAmos commented 1 year ago

What you have mentioned fixed the problem, so thanks for that.

What do the warnings mean? The Messaggi di avvertimento? Is it common in DBN to require all states to be present within all nodes? I've used pgmpy in the past and it has a very similar condition.

The reason I ask is that I updated the data to include all possible states in each sample.

genusDataOnlyShort.txt:

2 1 1 1 1 1 2 2 2
1 2 1 2 1 1 2 2 1
1 1 2 1 2 1 1 2 2
1 1 2 1 1 2 2 1 2

GenusHeadersShort.txt:

"Achromobacter_t0" "Acidovorax_t0" "Bosea_t0" "Achromobacter_t1" "Acidovorax_t1" "Bosea_t1" "Achromobacter_t2" "Acidovorax_t2" "Bosea_t2"
2 2 2 2 2 2 2 2 2
d d d d d d d d d

And when I run the following code:

library(bnstruct)

dataset.from.file <- BNDataset('genusDataOnlyShort.txt', 'GenusHeadersShort.txt', num.time.steps=3, starts.from=1)

dbn <- learn.dynamic.network(dataset.from.file, num.time.steps=3)
obs <- list("observed.vars" = c(1,4), "observed.vals" = c(1,1))

engine <- InferenceEngine(dbn, obs)
engine <- belief.propagation(engine)
new.net <- updated.bn(engine)

results <- em(engine, dataset.from.file)
updated.engine <- results$InferenceEngine
updated.dataset <- results$BNDataset

write_xgmml(dbn, path="RTesting/graphTEst")
plot(dbn)

I get the following output in the console:

> library(bnstruct)
> dataset.from.file <- BNDataset('/genusDataOnlyShort.txt', 'GenusHeadersShort.txt', num.time.steps=3, starts.from=1)
> dbn <- learn.dynamic.network(dataset.from.file, num.time.steps=3)
... ... bnstruct :: learning the structure using MMHC ...
... ... bnstruct :: learning using MMHC completed.
... ... bnstruct :: learning network parameters ... 
... ... bnstruct :: parameter learning done.
> obs <- list("observed.vars" = c(1,4), "observed.vals" = c(1,1))
> engine <- InferenceEngine(dbn, obs)
Warning message:
In get.adjacency(ig.mst, type = "both", attr = NULL, edges = FALSE) :
  The `edges` argument of `as_adjacency_matrix` is deprecated; it will be removed in igraph 1.4.0
> engine <- belief.propagation(engine)
> new.net <- updated.bn(engine)
> results <- em(engine, dataset.from.file)
... ... bnstruct :: starting EM algorithm ...
... ... ... bnstruct :: learning network parameters ... 
... ... ... bnstruct :: parameter learning done.
... ... bnstruct :: EM algorithm completed.
> updated.engine <- results$InferenceEngine
> updated.dataset <- results$BNDataset
> write_xgmml(dbn, path = "/RTesting/graphTEst")
Error in cat(x, file = file, sep = c(rep.int(sep, ncolumns - 1), "\n"),  : 
  argument 1 (type 'list') cannot be handled by 'cat'
> plot(dbn)
Error in plot.BN(dbn) : this function requires the graph package.

So, I can't write the network to file or plot the network. That's one error that I'm seeing. So, the plot error is me needing to figure out how to install the graph package from R as it's currently deprecated, from what I gathered. The second error is writing the dbn variable to xgmml for cytoscape visualization. So, again, am I missing something in writing the graph to xgmml file?

Also, just as a heads up, and for my own knowledge sakes, if I don't have every state in every node I can't do the inference steps. It throws errors. Is this common? Am I missing some basic understanding of how DBNs work and how every state is needed in every node?

Another issue that I saw when importing the data and setting the num.time.steps = 3. When I check the num.time.steps variable it says that it's equal to 1, rather than 3. Have I done something wrong here as well?

A couple of questions here.

albertofranzin commented 1 year ago

Hi,

the graph package has to be installed from Bioconductor https://www.bioconductor.org/packages/release/bioc/html/graph.html

The deprecated version is the one on CRAN. Using the version from Bioconductor I can plot the network without errors. You can use the package without it for many other things, but for plotting this check is the first thing in the method:

if (!requireNamespace("graph", quietly=T))
              stop("this function requires the graph package.")

I don't get the warning about the igraph package, maybe because I have an older version of it. I'll check. But I suspect that this is the reason for the error in the write_xgmml call (that I also get), since it internally uses igraph and the code that gives the error is not something I wrote. Honestly, I haven't checked this function in years, so it is definitely possible that I have not addressed past changes in igraph.

The num.time.steps value should be 3. I'll check this one too. However, as this package is not my main activity anymore, I'll see when I can make some time for it.

Aside from the implementation and package dependencies issues, the inference works. If I print dbn and new.new I see that the conditional probability tables have been updated. The warnings about not seeing all the possible values essentially tell you that the results that you get may not be reliable, since assumptions have to be made about the probability distributions of the observations, and it cannot be guaranteed they actually hold.

BKAmos commented 1 year ago

Hello!

I've got the graph to work. Thanks for pointing that out.

I still can not get the graph to write to xgmml but I can take the DAG matrix and just modify it and input it into Cytoscape, at least I think I can : ).

I still can not get the num.time.steps to equal anything but 1.

I do notice that the probabilities change and I've also noticed that the most.probable.values change from the different representations of the graph after either expectation maximization or belief propagation. So, that's interesting.

A question, what are your thoughts on the most effective way to compare results after either observation or intervention? Would you just compare the most probable value vectors before and after intervention?

Thanks for taking the time to look into some of these things.

Cheerfully, Kirtley

albertofranzin commented 1 year ago

Yes, I think comparing the most probable values is the simplest way to observe the effect of your observations/interventions. The probability tables are not the easiest thing to look at, and in the end what it matters is to observe what changes in the actual values you can get. Sampling new data and comparing the distributions of the values is also a good way of doing that I think.

The num.time.steps value is clearly a bug, I'll see when I can fix it, but in the meantime you can cross-check that at least the order is respected, so no arcs go against your specifications.