sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Discretization in breaks fails #4

Closed ktroule closed 5 years ago

ktroule commented 6 years ago

Hi I'm playing with bnstruct, and I'm facing a problem that I'm not been able to solve.

I'm using a matrix with continuous data, that has values from 0 to ~1.9. The problem is related to the discretization of the values.

BN <- BNDataset(data=t(scores.tumor.brca), discreteness = rep('c', 67), 
                variables = row.names(scores.tumor.brca), node.sizes = rep(2,67), starts.from=0)

net <- learn.network(BN)

Output

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... bnstruct :: learning the structure using MMHC ...
Error in cut.default(data[, i], quantiles, labels = FALSE, include.lowest = TRUE) : 
  'breaks' are not unique

I've checked whether the function actually is able to discretize in the 2 classes, and all 67 variables can be discretized in two classes.

for(i in 1:67){
  q <- cut.default(t(scores.tumor.brca)[,i], 2, labels = FALSE, include.lowest = TRUE)
  print(unique(q))
  }

The output for all 67 variables is always 1 2.

What I'm missing from this?

Thanks fro your time.

albertofranzin commented 6 years ago

Hi Kevin,

I am not able to reproduce the error with some example data.

a <- matrix(rep(0:19/10,4), nrow=4)
d <- BNDataset(data=t(a), variables=c("a", "b", "c", "d"), discreteness = rep('c', 4), node.sizes = rep(2,4), starts.from=0)
net <- learn.network(d)

The learning is successful. The for loop with the cut.default works as in your case.

Do you have the same error without the starts.from parameter?

Because the results can differ:

> for(i in 1:4){ q <- cut.default(t(a)[,i], 2, labels = FALSE, include.lowest = TRUE);   print(q) }
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2
 [1] 1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 1 1 2 2 2
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2
 [1] 1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 1 1 2 2 2
> for(i in 1:4){ q <- cut.default(t(a+1)[,i], 2, labels = FALSE, include.lowest = TRUE);   print(q) }
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2
 [1] 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2

The second case is equivalent to using starts.from, there are some differences in the output, so it might be that the distribution of your data suffers from the transformation (e.g. one variable has values only below 1).

ktroule commented 6 years ago

Yes, indeed, severa of my variables have values lower than 1. I've added a 1 to each variable, and remove starts.from and still get the same error. Just in case, my variables do not follow a normal distribution, and the dataset is quite sparse, about 17% of values are 0.

albertofranzin commented 6 years ago

Sorry, maybe I formulated it unclear :)

Adding 1 to each variable is equivalent to using starts.from=0, which is intended for already discrete data. The for loop you tried in your first post was applying the cut.default to the original data, while the learning was done using the transformed data, so they were effectively two different datasets. Hence, it is possible that in one case (without the additional 1 or starts.from) you get 1 2 for every variable, and in the other one (the one you were using to learn the network) you get an error. So you should try first by learning without starts.from, and without the additional 1.

BN <- BNDataset(data=t(scores.tumor.brca), discreteness = rep('c', 67), 
                variables = row.names(scores.tumor.brca), node.sizes = rep(2,67))

net <- learn.network(BN)

If it fails again, you can try to discretize the data manually and learn from these now discrete distributions, e.g.

newdata <- scores.tumor.brca
cutpoint <- 1
newdata[newdata <= cutpoint] <- 1
newdata[newdata > cutpoint] <- 2
BN <- BNDataset(data=t(newdata), discreteness = rep('d', 67), variables = row.names(newdata), node.sizes = rep(2,67))

net <- learn.network(BN)

or for any suitable value for cutpoint depending on your data (indeed, the distribution of the values is the most likely issue here).

ktroule commented 6 years ago

Sorry, I did explained myself wrongly. I did try to not add the 1 and keep starts.from as default, but it keeps failing.

Regarding to the discretization its something I'm not sure, as the values do not follow a normal distribution, I'm not sure how discretizing values not taking into account this can affect to the final network.

Thank

ktroule commented 6 years ago

Other question.

Regarding the section you talk about the Dynamic Bayesian Networks.

The layering option is used for the variables (nodes), but is there any way to tell the function which samples belong to which time? For instance, having samples at time 0, and samples at time 1.

Thanks

albertofranzin commented 6 years ago

If you mean a situation like

V1  V2  V3
 1   1   2   (time 1)
 2   2   2   (time 2)
 2   2   1   (time 1)
 2   3   3   (time 2)

where you have a layering e.g. 1 1 2 for the first instant, you have to put the observations together, by concatenating the different instants:

V1t1  V2t1  V3t1   V1t2  V2t2  V3t2 
   1     1     2      2     2     2   
   2     2     1      2     3     3 

and tell bnstruct that you have 2 instants using num.time.steps=2. This way the layering will be propagated throughout the other instants and become 1 1 2 3 3 4.

If you have instead a situation like

V1  V2  V3
not observed   (time 1)
 2   2   2   (time 2)
 2   2   1   (time 1)
not observed (time 2)

since bnstruct assumes coherence between observations you'll have to revert to missing data, e.g.

V1t1  V2t1  V3t1   V1t2  V2t2  V3t2 
   ?     ?     ?      2     2     2   
   2     2     1      ?     ?     ? 
ktroule commented 6 years ago

Thanks for the fast response.

Two silly questions.

First, the heder of the matrix can be anything that I want or must be VarName + t0, t1 ... tn Second, is it mandatory to tell the function the the layering? In my case I have no information of which nodes should be parents and childrens (which as far as I've read in the manual is what you do when layering).

O Once more, thanks.

In the meantime

bnset.brca <- BNDataset(scores.mix.discret,  discreteness = rep('d', 68), variables = colnames(scores.mix.discret), 
                        node.sizes = rep(3,68), starts.from=1, num.time.steps=2)

bnset.brca.boots <- bootstrap(bnset.brca, 100)

bnset.brca.boots.learn <- learn.dynamic.network(bnset.brca, bootstrap = TRUE, scoring.func = 'BDeu') I get this, when trying to run the last line

Error in .local(x, ...) : 
  promise already under evaluation: recursive default argument reference or earlier problems?
albertofranzin commented 6 years ago

1) That was just a way to make the example more clear. If you want to give a name for all the variables in all the instants, you can name them as you want. The easier way is however to specify only the variable names for the "real" variables (e.g. in one single instant, after all, in the other instants the variables are the same, just observed at a different time), and BNDataset() will add a _t1 for the first instant, _t2 for the second one, and so on. (I realize now in the previous example I made the time start at 0, sorry, instead is 1 as per R indexing, I'll fix above)

2) No, the layering can be omitted. In this case, the layering considered will be the natural one given by the time sequence (a node cannot have a parent in the future).

3) I have honestly never seen that. You have to use

bnset.brca.boots.learn <- learn.dynamic.network(bnset.brca.boots, bootstrap = TRUE, scoring.func = 'BDeu')

since in R when you apply a method you still have to assign the output to something, so bnset.brca does not have bootstrap samples. You should get an error message from bnstruct, anyway... I'll take a look when I can.

adw96 commented 6 years ago

Hi all,

I encountered the same issue. The problem can be fixed by modifying line 114 of utils.R from

quantiles <- quantile( data[,i], probs = (0:levels[i])/levels[i], na.rm = TRUE )

to

quantiles <- unique(quantile( data[,i], probs = (0:levels[i])/levels[i], na.rm = TRUE ))

This won't change any behaviour in datasets where the algorithm runs, but it will prevent this error in future.

I've submitted a pull request implementing only this change.

Cheers,

Amy

albertofranzin commented 6 years ago

Hi Amy,

thanks a lot! I merged the change.

Alberto

bogoni1 commented 5 years ago

Dear Amy,

I experience the same problem. Thus, how to modify line 114? Within "learn.network" function or "bnstruct" package? Is a script available for this purpose?

Tks a lot.

Best,

MariusGheorghe commented 5 years ago

Apparently Alberto has updated the CRAN package with the fix from Amy last friday (31.08.2018). If you have installed the package prior to that date, reinstall or install it from GIT via the devtools package in R. That fixed my problem.