sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

predict function #34

Open ningxuca opened 7 months ago

ningxuca commented 7 months ago

This is a question, not a issue per se. After structure learning and parameter learning. bnlearn package provides a predict() function. Predict is to make inferences on the input data using the parameters. Is there a similar functionality in bnstruct? The data I have for training and input data for prediction all contain NA values, so that bnstruct may be a better choice than bnlearn.

albertofranzin commented 7 months ago

Hello,

if you have NAs in your dataset you can learn a network by either imputing the missing values or using the SEM learning algorithm.

After learning, inference can be done by creating an InferenceEngine, providing some observations, then using the EM algorithm to predict the missing values. You can also use add.observations on the InferenceEngine, followed by belief.propagation.

Please check the documentation of the respective methods and Sections 5.1 and 6.1 of the package vignette, and let me know if you have any further questions.

ningxuca commented 7 months ago

Hi Alberto, Thank you so much for your explanation. Some background information: My goal is to use some lab measurements to predict patients’ survival status at 1 year mark, all variables are discreet. I have a training set and an input dataset. Both have NAs in the covariates which are 5 lab tests (lbtest1 to lbtest5). The response variable ‘survival’ has no missing values.

  1. How do I specify survival is the response (survival= either 1 (alive) or 2 (died)) and the other lab values (lbtest1, lbtest2, lbtest3, lbtest4, lbtest5) are explanatory variables when constructing DAG using training dataset? I can impute the NAs in lbtest1 to lbtest5 in the training dataset. I can use either the raw dataset or the imputed dataset to obtain the DAG.
  2. If using the existing DAG/parameters from the training dataset, how do I calculate the probability of survival status using the same set of variables (lbtest1 to lbtest5) in the input dataset? Do I deal with NAs in the input dataset the same way as the training dataset?
  3. My input dataset also has the real survival status, at the end I want to construct an AUC from the probability from step 2 and the real survival status to see how good the prediction is. Is it possible to use bnstruct for step 1 and 2? Emily
albertofranzin commented 7 months ago

Hi Emily,

I'm not sure I fully understand the scenario. You have 6 variables, 5 of which are the lab tests, and the last one is the survival. So you have (lbtest1..5) -> survival, and you want to 1) learn the set of precise dependencies among the variables, and 2) learn the associated conditional probabilities. You have a training set (with observations for all 6 the variables) and an additional set for which you want to predict the survival probabilities (to test the quality of the prediction). Both datasets contain NAs (except for the survival status of the training set). Is this correct?

If yes, then you can use bnstruct for both steps 1 and 2.

Step 1 is a modeling question, so I cannot suggest one particular solution because it depends on the medical application. In any case, whatever solution you decide to implement, all you have to do is to inject the proper prior knowledge to learn.network.

For example, you can use layering, and assign the lab variables to one layer and the survival status to a second layer. If you want to use, say, naive Bayes, you can manually define the structure (there is an example in the vignette). You can even decide to not use any assumption and see if by learning the network without any prior knowledge you get what you expect, that is, survival as an effect of the other variables.

In any case, with 6 variables you can use the SM algorithm, instead of the MMHC heuristic. You can either impute the data in advance, or use the SEM algorithm.

For step 2, it's just an inference task where you assume that survival in the input set is all NAs. The missing values in the input set can either be treated the same way as in the training set, or can be predicted using an InferenceEngine. In the end, inference can be considered as an imputation of the unobserved survival status, so you can then compute the survival values, and/or get the marginal probabilities and compare them with the ones you have.

I don't include code snippets since it may depend on your choices, but you can find examples around, either in the documentation or here on github among the issues.

Hope this helps.

Alberto