sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Predicting future time steps #17

Closed AnaR-Martins closed 4 years ago

AnaR-Martins commented 4 years ago

Hi,

First, I would like to thank you for the great package. I started recently working with bnstruct for a project of mine, however, I have a question that I'm not being able to find the answer. I have already learned a dynamic bayesian network from my data, but now I would like to use it to predict future time steps. This is, I have new data at the initial time step, and I want to predict the next time steps, according to the learnt DBN. Is it possible to do this? How should I do it?

Thank you in advance!

albertofranzin commented 4 years ago

Hi, you can perform inference by providing as observation the values of the variables in the first time step. Check Section 5.1 of the Vignette.

AnaR-Martins commented 4 years ago

I followed all the steps presented on section 5.1 with the child dataset, however I don't understand which parameter of the inference engine I should consider. For example, after performing belief propagation I get an engine and I have information like the junction tree, the number of nodes, the cliques and all the others that appear, but I don't get where to get the variables besides the ones I gave as observations. Or should I use the EM algorithm?

albertofranzin commented 4 years ago

Yes, the EM algorithm is the one to use. Honestly I never considered this particular case, but it's certainly interesting and I might (try to) do something more specific about it when I have more time.

What should work is:

1) learn the network 2) create an InferenceEngine engine = InferenceEngine(network) 3) add a new observation to your dataset, with only the variables of the first time step observed, something like

new.obs <- c(1,2,1,3, NA, NA, NA, NA, ...)
raw.data(dataset) <- rbind(raw.data(dataset), new.obs)

4) Run the EM algorithm, get the updated dataset, the imputed variables will be in the last row

output <- em(engine, dataset)
imputed.data(output$BNDataset)

I think this is the best way to go, let me know if it works.

AnaR-Martins commented 4 years ago

Thank you very much for your answer!

I think it worked for discrete variables, but not for continuous ones. In fact, in this case, despite all the variables being continuous in the initial dataset, the predicted values are both continuous and discrete. I know that the way that continuous variables are dealt with is by quantizing them before learning, however, in the raw dataset, they still appear continuous. This way, to add continuous variables as observations, should I also give the continuous values? Another question is if it is possible to choose the thresholds for the different levels for the continuous variables in the dataset, when defining it. I read about the "quantiles" slot, however, I didn't understand its usage, neither how I should provide the quantiles. If we can't define the thresholds, is it possible to know to which level was each value assigned?

If you could enlighten, I would be grateful. Thank you!

albertofranzin commented 4 years ago

About question 1:

despite all the variables being continuous in the initial dataset, the predicted values are both continuous and discrete.

Does it happen also with the last commit? If yes, is it possible for you to share a minimal working example and some data for me to reproduce the issue?

raw.data(dataset) contains the original data, while the discretization is done during the learning, and does not affect the original data.

About question 2: it's not possible to define the quantization thresholds from inside the package. What you have to do in that case is to discretize the variables of interest yourself, and treat them in bnstruct as discrete variables.

The quantiles<- assignment method is there for internal use. But you can check the quantiles used when discretizing a variable after the learning: quantiles(network) will give you the list of the thresholds for each continuous variable (a numerical vector for each variable), including the extreme values. Pay attention that the length of each vector may not match the desired one (e.g. what you specified in node.sizes), since it may not be possible to partition your observed values in the desired number of non-empty "buckets".

AnaR-Martins commented 4 years ago

Regarding the first question, I think it also happens with the last commit, yes. I will share with you the code I'm using, as well as the dataset.

training_JV.txt

The code I'm running is: `> dataset_JV <- BNDataset(data= training_JV,discreteness = rep('c',84),variables = c("X11" , "X21" , "X31" , "X41" , "X51" , "X61" , "X71" , "X81" , "X91" , "X101" , "X111" , "X121" , "X12" , "X22" , "X32" , "X42" , "X52" , "X62" , "X72" , "X82" , "X92" , "X102" , "X112" , "X122" , "X13" , "X23" , "X33" , "X43" , "X53" , "X63" , "X73" , "X83" , "X93" , "X103" , "X113" , "X123" , "X14" , "X24" , "X34" , "X44" , "X54" , "X64" , "X74" , "X84" , "X94" , "X104" , "X114" , "X124" , "X15" , "X25" , "X35" , "X45" , "X55" , "X65" , "X75" , "X85" , "X95" , "X105" , "X115" , "X125" , "X16" , "X26" , "X36" , "X46" , "X56" , "X66" , "X76" , "X86" , "X96" , "X106" , "X116" , "X126" , "X17" , "X27" , "X37" , "X47" , "X57" , "X67" , "X77" , "X87" , "X97" , "X107" , "X117" , "X127"), node.sizes = rep(4,84), num.time.steps=7)

dbn_JV <- learn.dynamic.network(dataset_JV, num.time.steps=7) engine_JV <- InferenceEngine(dbn_JV) new_obs <- c(1.200538,-0.632286,-0.202171,-0.403487,-0.074027,-0.322489,-0.157124,0.266058,-0.125360,-0.199477,0.029096,0.102552,rep(NA,72)) raw.data(dataset_JV) <- rbind(raw.data(dataset_JV),new_obs) output_JV <- em(engine_JV, dataset_JV) predicted_JV <- imputed.data(output_JV$BNDataset)`

Am I making any mistake? In this case, in the observations, I didn't add 1 to the observations, as I did for another discrete dataset that started from 0. Is this right? Both to add in the discrete case and not to add in the continuous one?

Regarding the second question, I was able to check the thresholds. Thank you!

albertofranzin commented 4 years ago

Thanks for the example and data, it was really helpful. I think I fixed the issue in the last commit. Let me know if it's working.

Your code is ok. The predicted values can then be accessed with tail(predicted_JV, 1) or any equivalent method.

AnaR-Martins commented 4 years ago

I unloaded the package with the command > detach("package:bnstruct", unload = TRUE), then installed the package again and finally loaded it again, but I still get the same result (with both continuous and discrete values). I am not sure if by doing this I'm uploading the last commit. Is there any other command that I should run to have the last commit?

Sorry for the question, I am still beginning to learn how to work with git and Rstudio. Thank you!

albertofranzin commented 4 years ago

Then I assume you are using the CRAN version, right?

In this case, you can just clone the github repository using git clone git@github.com:sambofra/bnstruct.git from a terminal. It works for sure on Linux and OSX, I guess it works also under Windows but I am honestly not sure...

If instead you already have cloned the github repo locally, you can update it by going in its folder and doing a git pull

Once you have the repository, you can install it with make install (note: this works only with the repo, the Makefile is not included in the versions submitted to CRAN).

If, for any reason, the above should not work or not be clear, I also include here the latest version of the package, that you can install like any other regular package. Feel free to ask me anything it there are things still unclear.

bnstruct_1.0.7.tar.gz

AnaR-Martins commented 4 years ago

I was able to install the last commit and it worked! Thank you very much for your help and kindness!

albertofranzin commented 4 years ago

Perfect, thanks to you. As soon as I have the chance I'll send the new version to CRAN.

I'm closing this issue, feel free to reopen it or to open a new one if you encounter other problems.

AnaR-Martins commented 4 years ago

Hi, I was trying to apply the em algorithm to another engine and dataset however I got the following error:

> output_Origin1500 <- em(engine_Origin1500, dataset_Origin1500_TT)
bnstruct :: starting EM algorithm ...
... bnstruct :: learning network parameters ... 
... bnstruct :: parameter learning done.
Error in quantiles[[ovr]] : subscript out of bounds

Do you have any idea of what it means? I was able to create both the inference engine and the dataset without any problems.

Thank you

albertofranzin commented 4 years ago

Hello,

how do you create the InferenceEngine? Can you please paste here all the commands, and a description of the variables (e.g. the header file)?

AnaR-Martins commented 4 years ago

The commands I followed were:

Split into training data and test data

> index_Origin1500 <- createDataPartition(y=Origin_1500$V12, p=0.75, list=FALSE)
> training_strat_Origin1500<- Origin_1500[index_Origin1500,]
> testing_strat_Origin1500<- Origin_1500[-index_Origin1500,]

Create dbn

> dataset_Origin1500_TT <- BNDataset(data=training_strat_Origin1500,discreteness = rep('d',12),variables = c("X0__0" , "X1__0" , "X2__0" , "X3__0" , "X4__0" , "X5__0", "X0__1" , "X1__1" , "X2__1" , "X3__1" , "X4__1", "X5__1"), node.sizes = rep(2,12), num.time.steps=2,starts.from=0)
> dbn_Origin1500_TT_BDeu <- learn.dynamic.network(dataset_Origin1500_TT, num.time.steps=2,scoring.func="BDeu")

Predict future time steps

> engine_Origin1500 <- InferenceEngine(dbn_Origin1500_TT_BDeu)

> raw.data(dataset_Origin1500_TT) <- addObservations(testing_strat_Origin1500, raw.data(dataset_Origin1500_TT),6)

> output_Origin1500 <- em(engine_Origin1500, dataset_Origin1500_TT)
> predicted_Origin1500 <- imputed.data(output_Origin1500$BNDataset)
albertofranzin commented 4 years ago

My bad, I forgot to check one condition in the last fix. I updated the belief propagation method, can you check with the package I attach here? bnstruct_1.0.8.tar.gz

AnaR-Martins commented 4 years ago

It worked! Thank you

indu-bodala commented 2 years ago

Hello,

I am following the above lines with my dataset (16 nodes). I have the following issues.

I am trying to predict new samples with missing values using em. For this, I initially tried to add new observations to the existing data using the following line:

raw.data(dataset_orig) <- addObservations(new_obs, raw.data(dataset_orig),9)

where I am adding 9 new obs to the existing data (also note that each observation have 8 variables observed and 8 variables unobserved)

When I use addObservations, I get an error:

Error in addObservations(new_obs, raw.data(dataset_orig), 9) : 
  could not find function "addObservations"

I also tried using

raw.data(dataset_orig) <- rbind(raw.data(dataset_orig), new_obs)

for which I am getting another error

Error in checkSlotAssignment(object, name, value) : 
  assignment of an object of class “data.frame” is not valid for slot ‘raw.data’ in an object of class “BNDataset”; is(value, "matrix") is not TRUE

So eventually, I created a new BNDataset with the new observations added in. Then ran the em algorithm with the previously created inference engine with dataset_orig

output <- em(engine_orig, new_dataset)

Now, I get another issue:

Error in while ((difference > threshold && no.iterations <= max.em.iterations) ||  : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In runif(1, lb, ub) : NAs produced
2: In runif(1, lb, ub) : NAs produced
3: In runif(1, lb, ub) : NAs produced

I installed the cran version 1.0.11. Kindly suggest how I can rectify these issues.

Thank you very much!

albertofranzin commented 2 years ago

Hello,

the method is add.observations, not addObservations. You can check the manual with > ?'add.observations<-'. It is a setter method for an InferenceEngine, so it cannot be used on a dataset but it has to be used as

net <- learn.network(dataset_orig) # or learn.dynamic.network, it is the same thing
ie <- InferenceEngine(net)
obs.vars <- c(...) # the 8 observed variables
obs.values <- c(...)  # the values taken by the variables you provided in obs.vars
add.observations(ie) <- list(obs.vars, obs.values)

Use the function to add one row, if you have nine rows to add do it in a loop.

The second error indicates that you are using the wrong data type. If you want to use rbind on the raw.data slot of the dataset, you have to provide the additional rows as an array, not a data.frame:

> a <- asia()
> class(raw.data(a))
[1] "matrix" "array"

The last error indicates that one of the conditions could not be verified, most likely because one of the learning steps failed and there was no difference variable. Without additional information I can only guess, but you are probably providing the dataset in some wrong format. Check how you create the new dataset, if you cannot solve the issue let me know, with the sequence of commands you followed.

indu-bodala commented 2 years ago

Hello,

I have some more information now:

So, I have a dataset of 9 observations, 16 variables and 2 timesteps. I learnt the dynamic network using the following steps.

train_2T <- read.csv("RC_imp_2T.csv")

discon <- c("C", "C", "C", "C", "C", "C", "D", "D", "C", "C", "C", "C", "C", "C", "C", "C")

ns <- c(5,5,5,5,5,5,4,4,5,5,5,5,5,5,5,5)

dataset_2T <- BNDataset(data = train_2T,
                     starts.from = 0,
                     num.variables = 16,
                     discreteness = discon,
                     node.sizes = ns,
                     variables = colnames(train_2T),
                     num.time.steps = 2,
                     na.string.symbol = 'NA')

layers_2T <- c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))

RC_dbn_2T <- learn.dynamic.network(dataset_2T, 
                                   num.time.steps = 2,
                                   layering = layers_2T)

Until this point, everything worked.

Now, I build an inference engine and use it to predict values for another partially observed datapoint (test_2T) with the first timestep (all 16 variables) and the first 8 values of the second timestep observed using em. I used the steps below.

engine_2T <- InferenceEngine(RC_dbn_2T)

test_2T <- test_2T+1

raw.data(dataset_2T) <- data.matrix(rbind(raw.data(dataset_2T), test_2T))

output_engine_2T <- em(engine_2T, dataset_2T)

When running the last line, the R session crashes saying that 'R session aborted. Encountered a fatal error'.

Could you check if I am following the correct steps and advise if I can do something to rectify the error with the last step? I also suspect if my dataset is too small (only 9 observations).

Thank you for your helpful advice.

albertofranzin commented 2 years ago

Hi,

can you try using add.observations and see if it works?

Also, print the network and see what it looks like. Nine observations in indeed very small and the results may not be what you expect.

indu-bodala commented 2 years ago

I am a bit confused here. Sorry if I understood wrong, but you previously mentioned that add.observations is a setter method for an InferenceEngine. So, it is not intuitive to me how I can predict missing values using em method when I add an observation with missing data to the inference engine using add.observations.

Do you mean something like below?

engine_2T <- InferenceEngine(RC_dbn_2T)
obs.vars <- var_names_2T[1:24] # observed 16 from first timestep and 8 from second timestep 
obs.vals <- test_2T
add.observations(engine_2T) <- list(obs.vars, obs.values)

engine_2T <-  em(engine_2T, dataset_2T)

I guess this won't work either.

Also, could you elaborate on what add.observations is doing here?

albertofranzin commented 2 years ago

Hi,

indeed, with add.observations you'll have to perform the belief propagation and sample the missing values manually, sorry for the confusion.

Anyway, without knowing what data you have or what you are using as test observations I cannot reproduce the error so I can't tell you what the problem is. Is it something you can share (also privately)?

indu-bodala commented 2 years ago

Hi Alberto,

Sorry for the late response. I have used add.observations as follows:

obs.vars <- test_vars[1:24] # observed 16 vars from first timestep and 8 vars from second timestep 
obs.vals <- test_vals[1:24]
add.observations(engine_dbn) <- list(obs.vars, obs.vals)
engine_dbn <- belief.propagation(engine_dbn)

I am getting the following error: Error in if (is.discrete[ovr]) { : missing value where TRUE/FALSE needed It seems like this is related to the discreteness of the added observation. Is there a way to provide discreteness of the added observations or make it same as the observations used for learning?

Btw, I can share the data privately. I have sent the dataset through email. I am learning the structure using first 18 observations and trying to predict the missing values in the test observation.

Thank you for the help!

albertofranzin commented 2 years ago

Hello,

there was indeed a bug when handling the observations in the belief propagation, sorry. I fixed it and uploaded it on github. I will test it a bit more before sending it to CRAN, let me know in case there are other issues.