sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

BNStruct #7

Closed sarafrr closed 5 years ago

sarafrr commented 5 years ago

Hi, I would like to do a BNDataset using a dataset which has both discrete type of variables and continuous one. I have a dataset of cancer the first variable is 'age', the second a grade of menopausal status (1,2,3) and then different types of hormones, the hystologic grade (1,2,3,4,5), the diameter and then the number of positive lymph nodes (1,2,5,10,massive). I think that I am making a mistake in using such different types of data to find out a bayesian network.

However, I ask if somebody has some suggestions to give me, Thanks, Regards, Sara

sarafrr commented 5 years ago

The dataset has columns 2,7 and 10 that has discrete variables ( Menopausal status, hystologic grade and positive lymph nodes). In the following piece of code I put the discrete variables in the last 3 columns in the dataset. And set the variable 'discreteness' = 7 times 'c' and 3 times 'd'. And the number of nodes equal to 2 for the first 7 variables and equal to the number o 'levels' for the discrete ones: for example, for what concerns the variable 'LPos', it has 38 values (1,2,3,...37,massive).

If I use only continuous variables and 2 nodes for each variable the bayesian.network() works, but not with other number of nodes. Looking at other posts, it should be that putting a different number of nodes, at least one quantization for one variable has not elements (samples), and thus there is the same error as the one reported below.

The creation of the BNDataset compiles, but when I try to do a bayesian network I have the ERROR:

.. ... ... ... ... ... bnstruct :: learning the structure using MMHC ... Error in cut.default(data[, i], quantiles, labels = FALSE, include.lowest = TRUE) : 'breaks' are not unique

The code follows:

p1 <- rep('c',7)
p2 <- rep('d',3)
p <- c(p1,p2)
cancerDataTOT <- cbind(cancerData_[,c(1:7)],cancerData[,c(2,7,10)])
matrixCancerDataTOT <-as.matrix(cancerDataTOT)
BNDataTOT <- BNDataset(matrixCancerDataTOT, discreteness = p, variables = c("Anni", "ER","PR", "Ps2", "Cath", "G", "Diam", "Men", "Ist", "Lpos"), node.sizes = 
                         c(rep(2,7),3,10,38))
net <- learn.network(BNDataTOT)

QUESTION 1: Why there is the error of breaks not unique using all the variables (continuous and discrete)? QUESTION 2: Why by increasing the number on nodes for the 'continuous case' comes out the error? Is the explanation which I fund correct?

albertofranzin commented 5 years ago

Hi Sara,

that error should have been fixed in the last commit, are you using the latest version? (the CRAN version is not yet on par with the github one).

However, it depends on the distribution of your data. If there is a value that is repeated several times (much more than the other ones), it will be likely selected twice as breakpoint (the quantile parameter in the cut function). In the latest commit there is a unique that prevents this to happen, but the side effect is that, for the variables where this situation happens, you will see less (discretized) values than the ones you expect.

I don't really have a solution for this, if not to preprocess the data to provide a dataset whose values are already discretized in a meaningful way.