sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Well, a lot of queries! #6

Closed deepunamb closed 5 years ago

deepunamb commented 5 years ago

Hi! I've been working with your package for quite some time till now. But many things came up into my mind while using it. Please do help me if possible.

  1. First, I found a tutorial like reference for your package on how to use it in a step-by-step manner. Well, it is documented in a very good way! I found something vague in it while reading it. In the section 4.1.2 : Learning Dynamic Bayesian Networks, Page no. 13-14, there is a code which explains about how to use bnstruct for learning DBNs. The datasets used in that code is "evolving_system.data" and "evolving_system.header". But I could not find these datasets inside the downloaded packages. All the datasets that used inside that reference are there inside the package except this. Please do something for this problem because since no one can see how that dataset looks like, no one can use the package for some other datasets.

Consider a data set like this: day hour pm2.5 DEWP TEMP PRES Iws 1 0 150 -21 -11 1021 1.79 1 1 150 -21 -12 1020 4.92 1 2 150 -21 -11 1019 6.71 1 3 150 -21 -14 1019 9.84 1 4 150 -20 -12 1018 12.97 1 5 150 -19 -10 1017 16.1 1 6 150 -19 -9 1017 19.23 1 7 150 -19 -9 1017 21.02 1 8 150 -19 -9 1017 24.15 1 9 150 -20 -8 1017 27.28 1 10 150 -19 -7 1017 31.3 1 11 150 -18 -5 1017 34.43 1 12 150 -19 -5 1015 37.56 1 13 150 -18 -3 1015 40.69 1 14 150 -18 -2 1014 43.82 1 15 150 -18 -1 1014 0.89 1 16 150 -19 -2 1015 1.79 1 17 150 -18 -3 1015 2.68 1 18 150 -18 -5 1016 1.79 1 19 150 -17 -4 1017 1.79 1 20 150 -17 -5 1017 0.89 1 21 150 -17 -5 1018 1.79 1 22 150 -17 -5 1018 2.68 1 23 150 -17 -5 1020 0.89

  1. This is a time series data that I have and I want to learn the structure and parameters using a Dynamic Bayesian Network. Just have a look at the dataset, it consists of data that are lying in different ranges..like one variable has only negative values, another one has very small values, while some other ones are having values of the 1000s range. Can i use your package to perform learning in Dynamic Bayesian Network? ( Don;t consider about the variables "day" "hour" )

  2. And one more thing, i saw one of the other issues and answers you posted before. I saw that you have explained about how the time series data should be there. But in your example, there were more than one row of values for variables at different time points. So my question is, in the above data, there are 24 time steps, and I have only one row of values for each time point.. So again, can i learn a dynamic network using only single row of values for each variable at each point of time? For example, if i change the above data like this as you said in a previous reply of yours to someone, 1 0 150 -21 -11 1021 1.79 1 1 150 -21 -12 1020 4.92 so on... My question is simple now, so will it work on this dataset..?

  3. So, i tried to learn the network by using the above data after removing the values of day and hour from each row of data. It gave the values like i said just above, 5 values and tried to learn dynamic bayesian network with 5 time steps.

150 21 11 1021 1.79 150 21 12 1020 4.92 150 21 11 1019 6.71 150 21 14 1019 9.84 150 20 12 1018 12.97

But this error comes up: bnstruct :: learning the structure using MMHC ... Error in cut.default(data[, i], quantiles, labels = FALSE, include.lowest = TRUE) : 'breaks' are not unique

I dont know what does this mean..

Thank you.

albertofranzin commented 5 years ago

Hi,

1) about the example dataset: those files do not exist, it was meant to be just a generic example but I understand this causes confusion (you're not the first one who points this out). I'll update the vignette as soon as I have the chance, but the format of the data is anyway described. In the meantine, you can check inst/extdata/asia_2_layers.{header,data}.

In the header file, you can provide the description for the variables of a single layer (as they are of course repeated through several layers); in the data file, each row is a full observation, where the different layers are "concatenated". So, suppose you have three binary variables V1, V2, V3, and you observe the system in 2 different instants, you'll have

V1 V2 V3
2 2 2
D D D

in the header file, and

0 1 1 0 0 1
0 0 1 0 0 1
1 0 0 1 0 0
...

in the data file, where each row has the values observed for, respectively,

V1_t1 V2_t1 V3_t1 V1_t2 V2_t2 V3_t3

2) The package expects the values of discrete variables to start from 1, and you can provide a different starting point using the starts.from parameter. For continuous variables it is handled automatically, but you have to provide a meaningful number of quantization levels. Please take a look at the documentation for read.dataset. That said, you should only provide the data you need, so you might need to preprocess your dataset to remove day and hour.

3) I honestly never tried to learn something from a single observation, but if it works, I don't know how reliable the results will be.

4) This should have been fixed in the latest commit. If it happens again, you might be providing the data in a wrong format. Do you treat them as discrete, continuous, how many quantization levels...?