sambofra / bnstruct

R package for Bayesian Network Structure Learning
GNU General Public License v3.0
17 stars 11 forks source link

Invalid number of intervals error #9

Closed spborder closed 5 years ago

spborder commented 5 years ago

Hello,

Couple of quick questions, I'm trying to use a dbn for a set of 16 different feature variables at 5 different time points.

  1. I currently have it set up so the dataset has the same number of observations per time point, would this be able to handle if there were unequal numbers of observations for each stage? I have one time point with a lower number and if I can avoid taking random samples of the others that would be ideal.

  2. I keep on getting the "invalid number of intervals" error when I use dbn <- learn.dynamic.network(DBN_dataset, num.time.steps=4, layering= layers, scoring.func = "BIC") and I'm not sure if that is because my continuous variable quantization numbers are wrong in the header file or if I have the wrong number of time steps. I have the dataset set up so the variables are (V1_t1, V2_t1,...V1_t2, V2_t2,...etc).

traceback_output

Thank you!

albertofranzin commented 5 years ago

Hi,

1) An observation has to be complete to be used. if I understand correctly, you have a situation like

V1_t1 V2_t1 V3_1 ... V1_t2 V2_t2 V3_t3
  1     2    1    ...    1    2     2
  1     3    2    ...    1    2     2
  1     2    1    ...   NA   NA    NA

is that right?

If yes, then you need to either do some imputation (using the built-in methods, using SEM for structure learning, or manually by yourself), or discard the observations with missing values (the entire row).

2) bnstruct follows the R convention for numbering (starts from 1 and not from 0), so if you have 5 time points you have to set num.time.steps = 5.

But I'm not sure that will solve the issue, most likely you're providing the wrong quantization numbers.

spborder commented 5 years ago

Alberto,

  1. Yes you're right there, no problem I can just use the number of samples that I have information for so they're all equal.

  2. It must be the quantization then. Do you have any tips for what the right number of intervals would be? I've tried a couple different methods (number of unique values, maximum value, possible values by number of digits in maximum, etc.) and all of them give me the same "invalid number of intervals error." I converted the feature values to integers already but some of the features are very large (O(10^7)) and some are smaller (O(10^2)).

Thank you

albertofranzin commented 5 years ago

Well, it's also context-dependent, that is, what those values mean.

You can try starting from low quantization values (2 or 3), see if it works, and then try to increase. Probably not the advice you were hoping for, but if you're sure a certain number of quantization levels should work, and you know how to split the intervals, you can replace it manually, it's tedious but easy to do.

Maximum value for sure won't work, since it will try to split the interval in O(10^7) intervals, the vast majority of which empty.

"Possible number of digits in maximum" I guess is 7; it makes sense intuitively since your values scale exponentially, but maybe the data is very unbalanced and the R internal function cannot understand where to split. You can try to replace those values with their logarithm in base 10, it won't do more harm to your analysis than the quantization. You might also try with 4 or 5 if you don't have very low values.

spborder commented 5 years ago

Ah still getting the same error. Could you explain a little more about what leads to the "invalid number of intervals" error? Could negative number features lead to that error? I guess I'm just not understanding why the intervals are invalid and not just leading to an inaccurate network.

spborder commented 5 years ago

Think I found the problem, for one of the features it was all zeros. When I deleted that one it ended up going through.

Thanks for your help!

albertofranzin commented 5 years ago

Great.

So, just for the sake of completeness, what was happening was that the method was trying to split the list of zeros in a certain number of intervals (e.g. 5); but the boundary values for those intervals were all zeros. Since boundary values have to be unique (you cannot have e.g. an interval [4,4]), the cut function was terminating with an error.

A better explanation with examples can be found for instance here.