mauro-idsia / blip

Bayesian network Learning Improved Project
GNU Lesser General Public License v3.0
30 stars 11 forks source link

Question about the dataset format #11

Closed zhangjy019 closed 5 years ago

zhangjy019 commented 5 years ago

Hi Mauro, I'm new to Bayesian network learning. As is shown in the example dataset "child-5000.dat", the variable values of datapoints are all integers. Can the code deal with float variable values in the dataset?

Another question is about the second line in the dataset of "variables cardinalities". If the data values are float, how to get the cardinality for one variable? Will there be thousands of cardinalities for a variable?

Thanks! zhangjy019

mauro-idsia commented 5 years ago

The score functions available in the Blip package all assume that the data is discrete (BIC, BDeu). You can first discretize the data, for example with Weka, and then use Blip.

On Wed, Apr 10, 2019 at 1:28 AM zhangjy019 notifications@github.com wrote:

Hi Mauro, I'm new to Bayesian network learning. As is shown in the example dataset "child-5000.dat", the variable values of datapoints are all integers. Can the code deal with float variable values in the dataset?

Another question is about the second line in the dataset of "variables cardinalities". If the data values are float, how to get the cardinality for one variable? Will there be thousands of cardinalities for a variable?

Thanks! zhangjy019

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFnakBX6qTkgjG0Y9kPn6kFmgGm9Jks5vfSIagaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

Thank you Mauro! About the second line in the dataset of "variables cardinalities", what does it exactly mean? Does 5 mean that the values of datapoints for the variable is among 0, 1, 2, 3 and 4? Or it means that the values can be any number like 0, 10, 20, 21, 22 and we just need to make sure the number of unique values are 5?

Now I have processed my data in the format of child-5000.dat. I have around 8000 variables. The first line is the variable index from 0 to 7999. I tried the above two different cardinalities for the second line. When I ran "java -jar blip.jar scorer.is -d data/mydata.dat -j data/mydata.jkl -t 10 -b 0", no error occurred but the jkl file is empty. Do you know what happened? Thanks!

mauro-idsia commented 5 years ago

A cardinality of 5 means that the possible values are in {0..4} (as shown in "child-5000.dat").

If you can provide the data, I can take a look at what happened.

On Wed, Apr 10, 2019 at 5:39 PM zhangjy019 notifications@github.com wrote:

Thank you Mauro! About the second line in the dataset of "variables cardinalities", what does it exactly mean? Does 5 mean that the values of datapoints for the variable is among 0, 1, 2, 3 and 4? Or it means that the values can be any number like 0, 10, 20, 21, 22 and we just need to make sure the number of unique values are 5?

Now I have processed my data in the format of child-5000.dat. I have around 8000 variables. The first line is the variable index from 0 to 7999. I tried the above two different cardinalities for the second line. When I ran "java -jar blip.jar scorer.is -d data/mydata.dat -j data/mydata.jkl -t 10 -b 0", no error occurred but the jkl file is empty. Do you know what happened? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-481743612, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFlmEZ4GOjNYlbyG-L7kWD2ArwDpYks5vfgWQgaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

Thank you! After compression, the data has around 30MB. Can I send it to you at mauro@idsia.ch?

mauro-idsia commented 5 years ago

Upload it to a webservice like https://wetransfer.com/ and send the link.

On Thu, Apr 11, 2019 at 8:04 PM zhangjy019 notifications@github.com wrote:

Thank you! After compression, the data has around 30MB. Can I send it to you at mauro@idsia.ch?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-482229978, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFpSm77VDj10VD6egj84hjThd-gS_ks5vf3kqgaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

https://we.tl/t-ZqziWY0raT

Thanks a lot for your help!

mauro-idsia commented 5 years ago

As explained in the README, the "-t" parameter indicates the maximum amount of time available for the exploration. We recommend an amount of time ranging from 10 seconds for variable to 60 seconds for variable. In your execution you allotted 10 seconds for 7732 variables. (I also recommend to take a look at the cardinalities; some have values in the hundreads, while the first lines of the dataset show that all the variables have values in 0-2)

On Fri, Apr 12, 2019 at 8:01 PM zhangjy019 notifications@github.com wrote:

https://we.tl/t-ZqziWY0raT

Thanks a lot for your help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-482667883, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFma55LWnQfVn2nmEYH55IrPleci1ks5vgMnlgaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

I changed the time limit to a larger value, say 60 seconds or even larger like 1000, but the output JKL file is still empty. I've checked the cardinalities and I think they are correct. The values are too sparse, that's why you can only see 0-2 in the first few lines.

If I just sampled the first 1000 lines and I can get the JKL file successfully.

mauro-idsia commented 5 years ago

As a rule of thumb, we recommend 60 seconds for variable. In your case, assuming the machine has 10 cores, you should allow for 12,8 hours of computation.

The datafile you want to analyze is big more than 6GB - Even just opening it in a text editor requires a considerable amount of time.

On Mon, Apr 15, 2019 at 9:10 PM zhangjy019 notifications@github.com wrote:

I changed the time limit to a larger value, say 60 seconds or even larger like 1000, but the output JKL file is still empty. I've checked the cardinalities and I think they are correct. The values are too sparse, that's why you can only see 0-2 in the first few lines.

If I just sampled the first 1000 lines and I can get the JKL file successfully.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-483379598, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFhsg4fqNu-hc4qyJzquoNeJd4HAuks5vhM6agaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

Thank you for all the information. The data is really large but sparse. The problem here is that the code will run several minutes and then finish. No error messages. It looks like that the code runs and finishes successfully. However, when you check the output JKL file, it is empty.

You can try "java -jar blip.jar scorer.is -d data/all_data.dat -j data/all_data.jkl -t 60 -b 0" and then leave it there. I'm using the server with 56 cores and it takes around 11 minutes to finish.

mauro-idsia commented 5 years ago

"As a rule of thumb, we recommend 60 seconds for variable".

Execute it with "-t 8200" (60 seconds for each variable, divided by 56 cores).

On Tue, Apr 16, 2019 at 10:46 PM zhangjy019 notifications@github.com wrote:

Thank you for all the information. The data is really large but sparse. The problem here is that the code will run several minutes and then finish. No error messages. It looks like that the code runs and finishes successfully. However, when you check the output JKL file, it is empty.

You can try "java -jar blip.jar scorer.is -d data/all_data.dat -j data/all_data.jkl -t 60 -b 0" and then leave it there. I'm using the server with 56 cores and it takes around 11 minutes to finish.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-483836678, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFh85xEcZApobRlLSW2TVU07iBMRyks5vhjapgaJpZM4clz7q .

--

Skana "Just ask yourself the right question."

zhangjy019 commented 5 years ago

Thanks anyway. -t 8200 cannot change anything. JKL file is always empty.

mauro-idsia commented 5 years ago

99% of the values in your dataset are 0. Any machine learning approach would fail to find meaningful correlations between the variables.

On Wed, Apr 17, 2019 at 7:32 PM zhangjy019 notifications@github.com wrote:

Thanks anyway. -t 8200 cannot change anything. JKL file is always empty.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mauro-idsia/blip/issues/11#issuecomment-484187749, or mute the thread https://github.com/notifications/unsubscribe-auth/AWpFFpWiMDWCBH3ayeiQ1ht83k4N3WJ2ks5vh1qigaJpZM4clz7q .

--

Skana "Just ask yourself the right question."