sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Error for "sem" algorithm #12

Closed annennenne closed 5 years ago

annennenne commented 5 years ago

I am having problems with the sem option for the learn.network() function. It produces a fatal error:

Error in while ((difference > threshold && no.iterations <= max.em.iterations) ||  : 
  missing value where TRUE/FALSE needed

and a lot (>= 50) of repititions of this warning:

In c(cpt1) * c(cpt2) :
  longer object length is not a multiple of shorter object length

This happens both on data with and without missing information. I have provided a minimal example that produces the error below. Am I specifying something wrongly or is there a bug? Thanks in advance!

#simulate data
n <- 100
set.seed(123)

    #data without missing information
    edata <- data.frame(Z = rnorm(n, mean = 10))     
    edata$X1 <- 0.5 * edata$Z + rnorm(n, mean = 0)
    edata$X3 <- rnorm(n, mean = 15)
    edata$X2 <- edata$X3 + rnorm(n, mean = 5) 
    edata$Y <- edata$X1 + edata$X2 + edata$X3 - edata$Z + rnorm(n, mean = 10)

    #data with missing information
    edata_wm <- edata
    set.seed(1234)
    edata_wm$X1[sample(1:n, 10)] <- NA
    edata_wm$X2[sample(1:n, 5)] <- NA
    edata_wm$X3[sample(1:n, 20)] <- NA

#minimal example of sem error (without missing information)
bn_edata <- BNDataset(edata, discreteness = rep(FALSE, 5),
                      variables = names(edata),
                      node.sizes = rep(6,5))

net_sem_edata <- learn.network(bn_edata, algo = "sem")

#minimal example of sem error (with missing information)
bn_edata_wm <- BNDataset(edata_wm, discreteness = rep(FALSE, 5),
                      variables = names(edata),
                      node.sizes = rep(6,5))

net_sem_edata_wm <- learn.network(bn_edata_wm, algo = "sem")
albertofranzin commented 5 years ago

In the complete case of the mwe (btw, thanks for providing one!), (S)EM is not supposed to be used with complete data, I'll add some checks.

As for the missing values case: there is indeed a bug, related to how continuous variables are treated. I will try to fix it in the coming weeks, as soon as I have the chance, but I can't make promises on when it will actually be ready.

In the meantime, one workaround that might work is to discretize the dataset in advance; sorry for that.

annennenne commented 5 years ago

Thanks for the quick reply!

I thought it would have just used MMHC with no missing information, and I think this would be a nice default option.

I tried discretizing the data (minimal example below), but now I get a new error message:

Error in cliques[[parents.list[clique]]] : 
  attempt to select less than one element in get1index

Here's my example code:

#make discetized data
n <- 100
set.seed(123)
edata <- data.frame(Z = rnorm(n, mean = 10)) 
edata$X1 <- 0.5 * edata$Z + rnorm(n, mean = 0)
edata$X3 <- rnorm(n, mean = 15)
edata$X2 <- edata$X3 + rnorm(n, mean = 5) 
edata$Y <- edata$X1 + edata$X2 + edata$X3 - edata$Z + rnorm(n, mean = 10)

edata_d <- as.data.frame(sapply(edata, function(x) as.numeric(cut(x, breaks = 6))))

edata_d_wm <- edata_d
set.seed(1234)
edata_d_wm$X1[sample(1:n, 10)] <- NA
edata_d_wm$X2[sample(1:n, 5)] <- NA
edata_d_wm$X3[sample(1:n, 20)] <- NA

#example of sem error (with missing information, only discrete variables)
bn_edata_d_wm <- BNDataset(edata_d_wm, discreteness = rep(TRUE, 5),
                         variables = names(edata_d_wm),
                         node.sizes = rep(6,5))

net_sem_edata_d_wm <- learn.network(bn_edata_d_wm, algo = "sem")
albertofranzin commented 5 years ago

Yes, one round of MMHC is how it should be working in this case; the issue now is that it continues with the rest of the method, and that's the part not working.

The new error seems quite serious, and I don't have any idea for that. I will investigate.

annennenne commented 5 years ago

Thanks!

albertofranzin commented 5 years ago

Should be fixed now, fingers crossed. Thanks for finding this.