sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Error in cut.default(data[, i], quantiles, labels = FALSE, include.lowest = TRUE) : invalid number of intervals #8

Closed jdanene closed 5 years ago

jdanene commented 5 years ago

I'm trying to do the dynamic bayesian on microarray data. There are 5 time periods and 9 samples per a time period, each time period contains 17 genes of interest.

Question 1: How does the bootstrap work with time series data. Is it a bootstrap within blocks of time or is a bootstrap over the whole time line?

Main Question: I tried running the learn.dynamic.network(ah_cte.BNDataset,num.time.steps = ah_cte.timePeriod, bootstrap = TRUE, scoring.func = "BIC")
over night with 100 bootstrap samples and I received the following error
Error in cut.default(data[, i], quantiles, labels = FALSE, include.lowest = TRUE) : invalid number of intervals. The bootstrap sample was derived using the bootstrap function

This might have to do with sampling with replacement too many duplicates, is there a way to turn it to sampling w/o replacement?

albertofranzin commented 5 years ago

Question 1: the bootstrap is done over the whole time line.

Question 2: no, since bootstrapping is based on resampling with replacement.

aazqueta commented 3 years ago

I am also having similar issues and I am not applying Bootstrap.....

albertofranzin commented 3 years ago

Hello,

usually that issue was not due to bootstrap (though bootstrap can probably make it happen in some unlucky cases), but it was caused by a bug that has been fixed. Can you provide more information so I can try to reproduce the error? What error are you getting, what data do you have, which commands are you trying, etc?

aazqueta commented 3 years ago

Hi Alberto,

thanks for your answer. I am training a Dynamic Bayesian Network on patient confidential data. I have 100 patients over 5 time periods and 7 variables:

names(df)
[1] "event"             "age"               "sex"               "diagnoses_I50"     "medications_02.02" "diagnoses_E11"    
 [7] "diagnoses_I10"     "event"             "age"               "sex"               "diagnoses_I50"     "medications_02.02"
[13] "diagnoses_E11"     "diagnoses_I10"     "event"             "age"               "sex"               "diagnoses_I50"    
[19] "medications_02.02" "diagnoses_E11"     "diagnoses_I10"     "event"             "age"               "sex"              
[25] "diagnoses_I50"     "medications_02.02" "diagnoses_E11"     "diagnoses_I10"     "event"             "age"              
[31] "sex"               "diagnoses_I50"     "medications_02.02" "diagnoses_E11"     "diagnoses_I10" 

I then convert it into BNDataset using the following commands:

BND_dataset_Dynamic <- BNDataset(data = df, 
                     discreteness = rep(F, ncol(df)),
                     variables = colnames(df),
                     layering = c(3,1,1,2,2,2,2),
                     num.time.steps = 5,
                     starts_from = 0,
                     node.sizes = rep(3, ncol(df))) 

dbn <- learn.dynamic.network(BND_dataset_Dynamic, num.time.steps = 5,
                             layering = c(3,1,1,2,2,2,2))

Since even is death (1 or 0) placed it at the bottom of the layers while age and sex at the first layer. Everything else goes into the second layer.

The error is given in the last line of the command when using the learn.dynamic.network() function. I will try to come up with a reproducible example to be more specific. Thanks a lot and congratulations for the R-package bnstruct, I find it really useful!

aazqueta commented 3 years ago

also two general questions I have about the package:

1) it seems that the entry discreteness from the BNDataset() is always set to FALSE? I have discrete variables in my analysis (variables that take only values 1 or 0) but whenever I place discreteness = TRUE, there is an error.

2) the entry node.size from the BNDataset() function is not clear what it does.... I always set it to 3 for all variables because I have seen it from most of the examples but is not entirely clear to me what it does.

Thanks

albertofranzin commented 3 years ago

Hi, and thanks.

First, in the dataset part it should be starts.from=0 instead of starts_from (a typo that will likely go unnoticed because of how R handles this kind of things). I also just realized the starts.from parameter is not listed in the documentation for BNDataset even if it's actually there, I will add it as soon as I have the chance.

In the second line you have discreteness = rep(F, ncol(df)) which sets all the variables as continuous. You have to change it to discreteness = c(T, T, T, F, F, F, F) assuming event, age and sex are discrete variables and the other ones are continuous. Regarding age, however, if it is a whole range of values you may want to aggregate it into age groups (if it's not already modeled like that), and therefore set it as continuous (a discrete variable with 100+ fields would make the computation semieternal). Which error do you get when you try to set the variables discrete? It may be due to either the typo mentioned above or the node.sizes issue below.

For discrete variables, the node.sizes parameter defines their cardinality, while for continuous variables it defines the number of levels into which it should be discretized (the package handles continuous variables in the data, not in the network). So, for event it will be 2 because you have only 0 and 1, while for a continuous variable it just depends on how much precision you want to have, or how much time you can afford to run your learning. There is no real general rule to choose the number of discretization levels, it really depends on your case. By default the discretization is done linearly, so if your data follows a different distribution (e.g. exponential) you have to either preprocess it manually, or to provide the boundaries for the discretization intervals using the quantiles parameter.

In general, when you have (a dataset for) a dynamic BN, provide the metadata only for the first time step, and let the package methods replicate the information through the rest of the dataset. However, provide the information for all the variables in the time step, the package has some flexibility in the input but, for several reasons, it is not too flexible. So, I recommend to use variables = colnames(df)[1:7] instead of the 35 variables.

I hope this helps. If the errors remain, and/or something is still not clear, feel free to let me know.

aazqueta commented 3 years ago

Hi Alberto,

thanks very much for your answer, it has been really helpful in understanding more the dynamics of the package. I followed your suggestions and it seems to work fine!

Age is definitely a complicated story, in my current set up it evolves by time (days) which might defer across patients. For example, a patient might have three time periods corresponding to when she was given a medication: e.g. at day 275, then at day 534, and the last one at 1040. Another patient might have been given the same medication in different time periods, e.g. at days 10, 50, and 70. What I did is to compress the times into three intervals and analyzing the status of each medication or diagnosis (1 if it was present and 0 if not). One question I also have is whether or not the package allows to have such irregular time periods across observations and also, if some observations (patients in our case) can have different time periods: one patient with 10 time periods vs. another one with only 2. In my current sample, I created a balanced dataset of the last 5 time periods of each patient and removed those that did not have at least 5-time observations.

Finally, the network produced looks a bit odd (for simplicity I removed age): Screenshot 2021-08-27 at 17 02 57

albertofranzin commented 3 years ago

Hi Andres,

glad it helped!

The data has to be a matrix, so each observation has to have the same number of time steps, and the number is defined by the highest one (e.g. 10). If an observation has less time steps, you can fill the rest with NAs. This will, of course, make the learning task more complex, and its results less reliable. If there are arcs you must have, you can impose them in the learning with mandatory.edges, though I don't think I ever tried it on a dynamic network, so I'm not sure it can work across time steps (unless you do everything manually and consider the network a simple BN).

Regarding age, it is a modeling problem, so the package itself cannot be much of help. If I understand correctly the question, you can manually set the discretization levels with (e.g.) quantiles = c(0, 10, 30, 70, 200, 1000) to partition the values into intervals [0,10], (10,30], (30,70], (70,200], (200,1000], assuming 0 and 1000 are respectively the lowest (or a lower bound) and the highest (or an upper bound) value you have, and a relative node.sizes entry of 5 (for the sake of the example).

That said, maybe it can be better to consider the relative difference in days since the last medication (10, 20, 40 instead of 10, 30, 70). This of course depends on the specific application, so make sure it applies to your case.

The resulting network indicates that there is not enough data to infer a conclusion on the conditional independence between many variables. In particular, the development of the disease (what I assume diagnoses is) seems to be independent from sex. The link between sex in different time steps makes sense (in a BN, see below) because (I assume) it doesn't change during the treatment; it is however strange that not all the sex variables are linked. Also, diagnosis and medication values are temporarily correlated, which makes sense. event_t5 is the only event with a cause, I would guess that the development of the disease starts to have an impact on life/death after that; however, by removing the observations with less than 5 time steps you are probably altering the distribution of life/death observations by removing early deaths, and by trimming the longer ones to 5 you are "keeping everyone alive". I guess that the values for event in time steps 1 to 4 are almost entirely 0, making the variable effectively conditionally independent from the diagnosis, and more in general from everything else in the data.

Consider also that a BN has only directed edges, so an arc from sex_t1 to sex_t2 can be understood only from this perspective. Of course it is not a causal link, but the way the learning algorithms says "these two variables are clearly related so I will put an edge here, and you told me that there is only one possible direction".

To be honest, I don't really know how to solve this, it is a modeling issue and I am not an expert in survival analysis. You can try, at least as an experiment, to split all the observations so that every time step of a patient becomes a stand-alone observation. I expect the results to make more sense, and the strength of the relationship between variables to be reflected in the conditional probability tables. Depending on how you have modeled the variables, you may also try to have pairs of consecutive time steps. This will of course not solve the problem of modeling the entire evolution, but it can probably be more interpretable than the current network.

aazqueta commented 3 years ago

Hi Alberto,

my apologies for the late reply. It has taken me some time to understand a bit more about my setup and whether or not it makes sense to use DBN for survival analysis. I still think it is a promising tool and I am eager to learn more.

In the meantime, I have started with the static version of the data, that is, I have just created a matrix of patients (rows) and columns indicating the age at the start of the analysis, gender, and whether or not throughout the patient history, a given diagnosis, medication or procedure was present. One thing I have noticed is that depending on how many patients I select, the network is computed or not. For example, this is how the network looks when I select the first 140 individuals: Screenshot 2021-11-02 at 16 28 51 and this is how the network looks when I select 145 individuals (while keeping the layering and the rest of things intact):

Screenshot 2021-11-02 at 16 30 10

I suspect that when more individuals are selected, there seems to be a causal link among all connections?

Thanks,

Andres

albertofranzin commented 3 years ago

Hello Andres,

it seems a bit strange that including 5 additional observations makes the arcs disappear. Do you observe something similar with other numbers of rows (e.g. 100, 120, 130, more than 145 if you can)? What do the probability tables look like in the first case?

aazqueta commented 3 years ago

Hi Alberto,

Below 140 it produces a graph. I have tried different combinations such as:

Only 30 rows: Screenshot 2021-11-05 at 14 23 35

Only 90 rows: Screenshot 2021-11-05 at 14 24 15

Only 120 rows:

Screenshot 2021-11-05 at 14 25 35

Only 142 rows: Screenshot 2021-11-05 at 14 26 16

Only 143 rows: Screenshot 2021-11-05 at 14 31 43

How do I compute the probability tables?

Cheers

albertofranzin commented 3 years ago

If a network is generated, by default it computes also the probability tables (the parameters), that you can access with cpts(network). The only requirement is that all the edges are directed, so it will not work if you use bootstrap.

I understand that every row is a different patient. Are they listed in some order? It looks like that with enough data the package is capturing some meaningful relationships between the variables, but as the amount of data grows the significance of the connections diminishes (too many edges), until it disappears (no edges).

How you define the variables, and if you're filtering the data beforehand (like what mention previously, that event is probably taking the same value too often) may have an impact too.

aazqueta commented 3 years ago

very interesting. When I compute the probabilities for 143 individuals I get the following:

$sex
sex
         1          2 
0.97619048 0.02380952 

$age
age
  1   2 
0.5 0.5 

$diagnoses_I10
diagnoses_I10
         1          2 
0.97619048 0.02380952 

$medications_01.02
medications_01.02
         1          2 
0.97619048 0.02380952 

$medications_02.02
medications_02.02
         1          2 
0.97619048 0.02380952 

$event
event
         1          2 
0.97619048 0.02380952 

While when I do the same for 142 individuals I get a much larger list:

$sex
sex
    1     2 
0.975 0.025 

$age
age
    1     2 
0.975 0.025 

$diagnoses_I10
, , medications_01.02 = 1, medications_02.02 = 1, diagnoses_I10 = 1

   age
sex         1   2
  1 0.9983607 0.5
  2 0.5000000 0.5

, , medications_01.02 = 2, medications_02.02 = 1, diagnoses_I10 = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_01.02 = 1, medications_02.02 = 2, diagnoses_I10 = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_01.02 = 2, medications_02.02 = 2, diagnoses_I10 = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_01.02 = 1, medications_02.02 = 1, diagnoses_I10 = 2

   age
sex           1   2
  1 0.001639344 0.5
  2 0.500000000 0.5

, , medications_01.02 = 2, medications_02.02 = 1, diagnoses_I10 = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_01.02 = 1, medications_02.02 = 2, diagnoses_I10 = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_01.02 = 2, medications_02.02 = 2, diagnoses_I10 = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

$medications_01.02
, , medications_02.02 = 1, medications_01.02 = 1

   age
sex        1   2
  1 0.996732 0.5
  2 0.500000 0.5

, , medications_02.02 = 2, medications_01.02 = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , medications_02.02 = 1, medications_01.02 = 2

   age
sex           1   2
  1 0.003267974 0.5
  2 0.500000000 0.5

, , medications_02.02 = 2, medications_01.02 = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

$medications_02.02
, , medications_02.02 = 1

   age
sex         1   2
  1 0.9935065 0.5
  2 0.5000000 0.5

, , medications_02.02 = 2

   age
sex           1   2
  1 0.006493506 0.5
  2 0.500000000 0.5

$event
, , diagnoses_I10 = 1, medications_01.02 = 1, medications_02.02 = 1, event = 1

   age
sex        1   2
  1 0.999179 0.5
  2 0.500000 0.5

, , diagnoses_I10 = 2, medications_01.02 = 1, medications_02.02 = 1, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 2, medications_02.02 = 1, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 2, medications_02.02 = 1, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 1, medications_02.02 = 2, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 1, medications_02.02 = 2, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 2, medications_02.02 = 2, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 2, medications_02.02 = 2, event = 1

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 1, medications_02.02 = 1, event = 2

   age
sex            1   2
  1 0.0008210181 0.5
  2 0.5000000000 0.5

, , diagnoses_I10 = 2, medications_01.02 = 1, medications_02.02 = 1, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 2, medications_02.02 = 1, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 2, medications_02.02 = 1, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 1, medications_02.02 = 2, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 1, medications_02.02 = 2, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 1, medications_01.02 = 2, medications_02.02 = 2, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

, , diagnoses_I10 = 2, medications_01.02 = 2, medications_02.02 = 2, event = 2

   age
sex   1   2
  1 0.5 0.5
  2 0.5 0.5

Does this make much sense?

albertofranzin commented 3 years ago

Sorry, I had said it wrong in my previous comment, anyway, yes, what you see makes sense.

A variable with no incoming edges will have only its probability (e.g. the probability values for 1 and 2).

However, when a variable has some parent nodes, its probability will depend also on the values of its parent nodes. What you see is not a different number of tables, but the fact that in the second case (142 observations) the "tables" have more dimensions (and the printout is ugly).

Anyway, when you look at the probabilities for 143 patients, you clearly see that, with the exception of age (which I don't know how you treat), all the other values have values that appear in almost all the observations. The other values probably appear in the middle of your dataset, and that's why you have some meaningful networks for 90 and 120 patients. After that, their contribution vanishes, and in fact for the 142 patients when you compare the probability for the values of event for a given set of parents you see mostly (0.5,0.5).

In other words, your dataset is too homogeneous, and you need more observations with different values.

aazqueta commented 3 years ago

Hi Alberto, thanks very much for the explanation, it makes a lot of sense! yes, I am treating age as a binary variable (above or below median: 1 or 0). And the rest of the variables are also just 1 or 0. I guess if I would have continuous data would give me that heterogeneity in the data set that I need? Also, perhaps adding more variables. Regarding this last point, I read somewhere that for more than 30 variables the algorithm has a hard time finding the relationships?

Thanks again for the wonderful explanations!

albertofranzin commented 3 years ago

In bnstruct the continuous variables get discretized anyway, so you may just increase the number of possible values each variable can take. Or use another package :)

Anyway, one of the main concerns is event, which, if it is alive/dead, is hard to augment. For this you need other observations.

Regarding the amount of variables: 30 or more nodes make the learning infeasible using exact algorithms (in bnstruct the one implemented is sm), but it's still fine for approximate algorithms. The quality of the learning depends anyway also on the data you have, so even with an exact algorithm you cannot be sure the resulting network matches the real generating process. Finally, a structure learning algorithm is unable to distinguish between networks in the same class of equivalence, that is, representing the same set of conditional independencies. I know there is progress in the research on this, but I'm not really familiar with it, and it is anyway not implemented in bnstruct.

Anyway, at least for the sake of curiosity, in the package vignette there are some experiments with running time and solution quality for the learning algorithms implemented.

aazqueta commented 3 years ago

thanks again for the explanation. I guess it makes more sense then to make "event" a continuous variable (days to death for example). A further problem I see with Bayesian Networks in survival analysis is that of censoring (when you stop having observations of patients and dont know if they died...).

Regarding the structure learning unable to distinguish between networks in the same class of equivalence, what would be an example of this phenomenon? Cheers!

albertofranzin commented 3 years ago

The first part is about modeling the specific problem and therefore out of my area of expertise, so I cannot really give reliable indications.

Regarding the second part, the most simple case is when you have a system with two nodes A and B and an arc between them. The two networks A -> Band B -> A belong to the same equivalence class, and are therefore equivalent for a structure learning algorithm.