How to infer for NAN values for multinomials?

probcomp / BayesDB

A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. New implementation in http://github.com/probcomp/bayeslite

http://probcomp.csail.mit.edu/software/bayesdb/

Apache License 2.0

888 stars 52 forks source link

How to infer for NAN values for multinomials? #14

Closed jostheim closed 10 years ago

jostheim commented 10 years ago

I am reading a file with some binary classifications, some of the classifications are set to None (or NAN, or nan, or whatever, I've tried them all). BayesDB works fine reading in the file, and properly sets the column type to multinomial.

I then go ahead and init and analyze:

client('INITIALIZE 20 MODELS FOR tourney_table;') client('ANALYZE tourney_table FOR 100 ITERATIONS;')

Then I infer:

tmp = client("INFER winner FROM tourney_table WITH CONFIDENCE 0.01 ;", pretty=False)

and I print out tmp:

[{'data': [(0, '1.0'), (1, '1.0'), (2, '1.0'), (3, nan), (4, nan), (5, nan), (6, nan), (7, '1.0'), (8, '1.0'), (9, '1.0'), (10, '1.0'), (11, nan), (12, '1.0'), (13, nan), (14, '1.0'), (15, '1.0'), (16, '1.0'), (17, nan), (18, '1.0'), (19, '1.0'), (20, '1.0'), (21, '1.0'), (22, nan), (23, '1.0'), (24, nan), (25, '1.0'), (26, '1.0'), (27, nan), (28, nan), (29, '1.0'), (30, nan), (31, nan), (32, '1.0'), (33, nan), (34, '1.0'), (35, nan), (36, nan), (37, nan), (38, '1.0'), (39, '1.0'), (40, '1.0'), (41, nan), (42, nan), (43, nan), (44, '1.0'), (45, nan), (46, nan), (47, nan), (48, '1.0'), (49, nan)
...
(572, '1.0'), (573, '1.0'), (574, '1.0'), (575, '1.0'), (576, '1.0'), (577, '-1.0'), (578, '1.0'), (579, '1.0'), (580, '-1.0'), (581, '1.0'), (582, '1.0'), (583, '1.0'), (584, '1.0'), (585, '-1.0'), (586, '1.0'), (587, '-1.0'), (588, '1.0'), (589, '-1.0'), (590, '-1.0'), (591, '-1.0'), (592, '-1.0'), (593, '-1.0'), (594, '1.0'), (595, '1.0'), (596, '-1.0'), (597, '1.0'), (598, '-1.0'), (599, '1.0'), (600, '-1.0'), (601, '1.0'), (602, '1.0'), (603, '1.0'), (604, '-1.0'), (605, '-1.0'), (606, '-1.0'), (607, '-1.0'), (608, '1.0'), (609, '1.0'), (610, '-1.0'), (611, '-1.0'), (612, '1.0'), (613, '-1.0'), (614, '1.0'), (615, '-1.0'), (616, '1.0'), (617, '1.0'), (618, '1.0'), (619, '1.0'), (620, '1.0'), (621, '1.0'), (622, '1.0'), (623, '1.0'), (624, '-1.0'), (625, '-1.0'), (626, '1.0'), (627, '1.0'), (628, '1.0'), (629, '1.0'), (630, '1.0'), (631, '1.0'), (632, '1.0'), (633, '1.0'), (634, '1.0'), (635, '1.0'), (636, '1.0'), (637, '1.0'), (638, '1.0'), (639, '1.0'), (640, '1.0'), (641, '1.0'), (642, '-1.0'), (643, '1.0'), (644, '1.0'), (645, '1.0'), (646, '-1.0'), (647, '1.0'), (648, '1.0')]

Obviously I am looking for a prediction of something other than "nan" for those initial values (something in [1.0, -1.0]).

I have no doubt that the algorithm is returning me 'nan' because it assumed the set of values for that column is [1.0, -1.0, 'nan'], instead of [1.0, -1.0], but I have not been able to figure out how to make it recognize 'nan' (or 'None') as not a valid column value, but as one to infer?

jbaxter commented 10 years ago

Hi @jostheim, to help me figure out what's going on, could you tell me what branch and commit you're on? Thanks!

jostheim commented 10 years ago

Of course! Sorry I didn't include it originally.

commit ebc10f0a901b8f207934acf5f196f49dec16ad93 Author: Jay Baxter jbaxter@mit.edu Date: Fri Feb 21 20:40:35 2014 -0500

DROP MODELS working

Oh and let me put in a snippet of the file I am using:

index,TEAM2_pyth,TEAM1_oppd_rnk,TEAM1_kaggle_id,TEAM1_opp_pyth_rnk,TEAM1_w,TEAM2_adjt_rnk,TEAM2_conf,TEAM2_adjd_rnk,TEAM2_kaggle_id,TEAM2_w_per,TEAM2_seed,TEAM2_w,TEAM2_kenpom,TEAM1_adjo_rnk,TEAM2_oppd_rnk,TEAM2_adjo_rnk,ROUND,TEAM1_l,TEAM1_oppo_rnk,TEAM2_opp_pyth_rnk,TEAM1_ncopp_pyth_rnk,TEAM2_adjo,TEAM2_ncopp_pyth,WINNER,TEAM1_seed,TEAM1_oppd,SEED1,SEED2,SCORE1,SCORE2,TEAM2_rpi,TEAM1_ncopp_pyth,TEAM1_oppo,TEAM2_ncopp_pyth_rnk,TEAM1_team,TEAM2_adjt,TEAM1_adjo,TEAM2_oppo_rnk,TEAM1_w_per,TEAM1_adjd,TEAM1_rpi,TEAM2_adjd,TEAM2_oppo,TEAM2_luck,TEAM1_conf,TEAM1_adjt,TEAM2_oppd,TEAM1_opp_pyth,TEAM2_opp_pyth,TEAM1_year,TEAM1_pyth,TEAM2_team,TEAM2_luck_rnk,TEAM1_luck,TEAM1_tour,TEAM1_luck_rnk,TEAM1,TEAM2,TEAM1_kenpom,TEAM2_l,TEAM2_tour,TEAM2_year,TEAM1_adjt_rnk,TEAM1_adjd_rnk
0,0.2817,200,693,310,20.0,185,8,321,645,0.4166666666666667,25,15.0,256,313,303,154,4,17.0,327,306,198.0,101.5,0.4993,NAN,19,101.7,16,16,NAN,NAN,290,0.4762,96.3,165.0,67,65.7,91.5,305,0.5405405405405406,96.2,205,110.1,98.2,-0.035,14,66.8,103.6,0.3484,0.3509,2013,0.3593,76,271,0.037000000000000005,0,90,67,76,220,21.0,0,2013,129,77
1,0.3593,4,651,9,35.0,129,17,77,693,0.5405405405405406,25,20.0,220,4,200,313,2,5.0,16,310,114.0,91.5,0.4762,NAN,14,96.3,1,16,NAN,NAN,205,0.5415,104.9,198.0,42,66.8,117.4,327,0.875,86.4,2,96.2,96.3,0.037000000000000005,4,66.8,101.7,0.7278,0.3484,2013,0.9713,112,90,-0.016,0,229,42,112,1,17.0,0,2013,126,3
2,0.8582,28,557,34,26.0,99,27,73,676,0.6764705882352942,30,23.0,31,7,64,18,2,9.0,59,76,174.0,112.2,0.4578,NAN,26,97.4,8,9,NAN,NAN,42,0.4952,103.7,224.0,15,67.6,116.4,92,0.7428571428571429,99.3,20,95.9,102.9,-0.032,16,64.9,98.7,0.6709,0.6181,2013,0.8615,96,262,0.02,0,132,15,96,30,11.0,0,2013,218,135
...
644,0.4073,46,779,45,30.0,108,8,115,645,0.53125,18,17.0,185,11,251,253,2,2.0,50,243,8.0,95.0,0.7499,1.0,0,96.8,1,16,82.0,63.0,161,0.8114,106.2,27.0,82,69.1,118.0,236,0.9375,87.6,3,98.2,98.8,-0.011000000000000001,0,68.1,103.9,0.7428,0.3611,2004,0.9684,76,174,0.055999999999999994,0,36,82,76,6,15.0,0,2004,149,10
645,0.5148,114,607,107,28.0,48,16,109,826,0.58064516129032,17,18.0,146,8,235,202,2,3.0,95,176,26.0,98.3,0.6619,1.0,1,99.1,2,15,76.0,49.0,139,0.7519,104.0,75.0,31,70.8,119.3,130,0.90322580645161,92.4,17,97.8,102.8,0.057999999999999996,27,68.3,103.2,0.6351,0.4868,2004,0.9497,174,33,0.008,0,128,31,174,15,13.0,0,2004,136,40
646,0.9354,6,671,14,18.0,62,31,15,699,0.73529411764706,10,25.0,22,12,81,39,2,12.0,33,54,11.0,111.8,0.7829,-1.0,6,94.2,7,10,66.0,72.0,27,0.7947,106.9,14.0,49,70.4,118.0,42,0.6,96.9,38,88.6,106.5,-0.046,2,65.5,97.8,0.8102,0.7273,2004,0.9062,105,249,-0.024,0,199,49,105,32,9.0,0,2004,243,97
647,0.4607,19,559,24,33.0,228,2,155,828,0.7,17,21.0,168,4,230,187,2,6.0,26,267,73.0,99.0,0.5529999999999999,1.0,1,95.6,2,15,70.0,53.0,115,0.664,107.2,170.0,16,66.0,119.9,287,0.8461538461538499,85.5,2,100.4,95.6,0.1,4,69.7,103.1,0.7899,0.294,2004,0.9799,175,6,0.006,0,135,16,175,2,9.0,0,2004,86,5
648,0.2706,285,593,298,15.0,264,24,197,644,0.64516129032258,18,20.0,232,240,288,250,4,17.0,298,311,98.0,95.2,0.2881,1.0,13,104.9,16,16,72.0,57.0,202,0.6265,94.9,304.0,25,65.1,96.0,312,0.46875,107.6,253,103.7,93.2,0.047,14,72.5,105.0,0.2408,0.2025,2004,0.2115,75,47,-0.004,0,159,25,75,255,11.0,0,2004,20,262

I can upload some more of the file if need be, and I am pretty good with Python if I can get pointed in the right direction. I am analyzing some NCAA tournament data that I analyzed last year with RandomForests... but really I just wanted to try it out.

jbaxter commented 10 years ago

Thanks for being so prompt and descriptive -- was able to replicate this and push a fix. Turned out to be a bug in INFER where we were using "if var" where we should have been using "if var is not None".

I'm very interested in hearing how this analysis works out! Please comment when you are done so I can hear your results; we are always interested in more case studies of BayesDB used on real-world datasets!

I would also recommend using 100 models and 250 iterations in order to get high-quality results, although I know that can start to get pretty compute-heavy.

jostheim commented 10 years ago

Oh goodness, I can't tell you how many times that has bitten me, especially with boolean variables, which python will evaluate to integers. Thanks for the fix, I'll pull it and try and run it tonight.

Thanks for the parameter suggestions as well, I want to get it tested out then swap it over to my 64GB RAM workstation to chomp on.

I'll post back when I have some results, I'll probably do a blog post too!

vkmvkmvkmvkm commented 10 years ago

Hi @jostheim,

Thanks for helping to put our alpha through its paces :) and for sending us such clear, helpful information.

We're currently working on an inference quality test suite internally, which is sure to flush out many bugs in the inference engine. Please do let us know if something you find might be a good candidate addition, or seems like an anomaly we should look into more closely.

More generally, are you interested in talking with us a bit to help us determine how to evolve the project, or in talking to us a bit more about how to address problems in sports analytics? We're new to the domain, but think it's a great proxy for many problems of more general interest.

Vikash

On Tue, Feb 25, 2014 at 8:10 PM, jostheim notifications@github.com wrote:

Oh goodness, I can't tell you how many times that has bitten me, especially with boolean variables, which python will evaluate to integers. Thanks for the fix, I'll pull it and try and run it tonight.

Thanks for the parameter suggestions as well, I want to get it tested out then swap it over to my 64GB RAM workstation to chomp on.

I'll post back when I have some results, I'll probably do a blog post too!

Reply to this email directly or view it on GitHubhttps://github.com/mit-probabilistic-computing-project/BayesDB/issues/14#issuecomment-36079810 .

jostheim commented 10 years ago

No problem, I am an old (well that is relative) Bayesian Network guy, I've built my own SPI based BN learning engine (in Java :( ). I did MCMC for my thesis in grad school (in 2002, fitting radial profiles of dwarf galaxies), so I am very fond of these types of approaches. Honestly I've run into a point with Bayesian Networks where I can't find good ways of scaling them, and it seems that this kind of technique (the latent variable "stuff" in crosscat) is the next step in the paradigm. So I am quite interested in these topics both from a practical data science point of view and a theoretical point of view.

That was a long way of saying, yes I am happy to help/talk/report-bugs. In terms of sports analytics, I am definitely not a professional at it, I just started dabbling with NCAA data last year, and I have some NFL analyses that I've done with RandomForests I want to try out too.

I think that one thing I'll definitely do once I get this running is find a way to plot out the column dependencies in a BN sort of format (but not a DAG obviously). The variable dependencies and strengths (or more likely groups of variables, like markov blankets), presented in a meaningful way is something that other ML techniques don't do very well and would be quite interesting.

I'll look for your test suite and try to add tests if I catch anything (and of course send you a pull request).

jostheim commented 10 years ago

After running this data through the newest release, I get an accuracy of about 60% on predicting the mutlinomial. Scikit-learn RandomForest gets an accuracy of around 75%. I realize that this project is not shooting for classification accuracy as a meaningful metric, but it feels like after 14 hours of running I'd expect a bit better. I am probably doing something wrong.

Here is my initiialize and analyze steps:

client('INITIALIZE 300 MODELS FOR tourney_table;')
client('ANALYZE tourney_table FOR 600 ITERATIONS;')

I did a fairly big run to try and make sure I could get good accuracy (as in the comment by @jbaxter above), it took about 14 hours to run on 8 cores.

Then to compute accuracy (a bit of pandas code leaked in):

tmp = client("INFER winner FROM tourney_table WITH CONFIDENCE 0.5 ;", pretty=False)
correct_count = 0
total_count = 0
for (index, val) in tmp[0]['data']:
    if str(train_features.ix[index]["WINNER"]) == "nan" and float(val) == test_features.ix[index]["WINNER"]:
        correct_count += 1
    if str(train_features.ix[index]["WINNER"]) == "nan":
        total_count += 1
    print val, test_features.ix[index]["WINNER"], float(val) == float(test_features.ix[index]["WINNER"]), train_features.ix[index]["WINNER"]
print correct_count, total_count, float(correct_count)/float(total_count)

161 265 0.607547169811

Has anyone else done comparisons on simple classification or regression against other ML techniques? How would you guys expect this to perform? Am I doing something silly that is suboptimal (or is my code wrong, I wrote it very quickly).

Thanks in advance!

jbaxter commented 10 years ago

Hi @jostheim, thank you for taking the time to run this experiment! After running for 14 hours on 8 cores, we’d want better too! Luckily, one item on our development roadmap is likely to increase performance by at least 10x.

There is a CrossCat paper that’s currently accepted and under review that will contain comparisons to a few standard baselines - things like random forests, SVMs, etc. - with mixed results.

As you mention, BayesDB is designed to estimate the joint probability density, not classification accuracy. Future versions of BayesDB are likely to let the user perform classification, using extended versions of the CrossCat engine, random forests, and other similar models if they would like.

It’s also very important to consider that BayesDB doesn’t have support to set a decision boundary based on a loss function yet. When you say “INFER… WITH CONFIDENCE 0.5”, 0.5 actually isn’t a decision boundary: it simply indicates to fill in the most probable value, if we are at least 50% sure of it (which, in a binary classification setting, we always will be by definition). In the future, we imagine a BQL command that would allow the user to specify columns of interest for classification, which would CrossCat to adjust its likelihood function in order to optimize for classification on those columns.

In addition to those considerations, we would appreciate it if you were willing to show us your entire experimental setup, including a cross-validation harness or anything like that you used. Depending on how you ran the RandomForest, these accuracy numbers may be a symptom of overfitting, but it is hard to be certain without seeing the experimental setup.

After checking the experimental setups, we would additionally want to check the datatypes (use SHOW SCHEMA and UPDATE DATATYPES) to ensure that all data has the proper type. Since BayesDB is currently in alpha development, there is also the possibility of a bug in the inference engine, which we would need by debug by looking at the logged diagnostic information.

jbaxter commented 10 years ago

Hi @jostheim, we just caught a bug that has a huge effect on INFER accuracy! Could I trouble you to run your INFER query again and re-score the results? You don't need to re-run INITIALIZE MODELS or ANALYZE, since the bug was just in INFER itself.

jostheim commented 10 years ago

Not a problem, I was going to respond anyway with code and data, I just got caught up with my day job and family :)

I'll rerun as soon as I can and report back!

On Saturday, March 1, 2014, Jay Baxter notifications@github.com wrote:

Hi @jostheim https://github.com/jostheim, we just caught a bug that has a huge effect on INFER accuracy! Could I trouble you to run your INFER query again and re-score the results? You don't need to re-run INITIALIZE MODELS or ANALYZE, since the bug was just in INFER itself.

Reply to this email directly or view it on GitHubhttps://github.com/mit-probabilistic-computing-project/BayesDB/issues/14#issuecomment-36437490 .

jostheim commented 10 years ago

Can I get an email address I can send a link to pick up code and data (I don't want to post it all publicly)? Is an iPython notebook okay?

jbaxter commented 10 years ago

Yes, you can send it to bayesdb@mit.edu (which will go to our team of a few people), and an iPython notebook is definitely ok. Thanks!!

jostheim commented 10 years ago

I ran with the newest code and the accuracy jumped up to around 69%! That is a ~ 9% jump, so that was a good fix! Cleaning up the code to send to you guys...

Do you guys have any test cases running through some exactly correlated columns, and other tests some exactly random columns? I have some of those tests with mutual information, conditional mutual information functions that give me some confidence things are working (at the extremes at least).

jbaxter commented 10 years ago

Thanks, that's good to hear :)

And yeah, we've run a number of tests with exactly correlated vs. exactly random columns, and have found that BayesDB does quite well in recovering which columns are correlated and which aren't.

jostheim commented 10 years ago

Sent on my code (finally).

I was asking about the constraint testing more to verify the accuracies rather than correlations, though I phrased that poorly since Mutual Information is usually for correlation. I was thinking you could setup 10 columns all correlated (and uncorrelated), and run your algorithm and then run RF and sanity test your performance. Alternatively this is not a sanity test, but a performance comparison :)

vkmvkmvkmvkm commented 10 years ago

Hi @jostheim,

It would help us to get a better sense of your use cases.

Is multiclass classification with e.g. 0-1 loss on ~100D fully-observed data a key use case for you? If yes, is accuracy your driving consideration? Do you need (or even benefit from) calibrated uncertainty? Are there other similar classification/regression tasks that are important? Do you often have specialized losses that are worth taking into account?

We want BayesDB to make it easy for people focused on classic pattern recognition problems to deploy best-in-class ensemble methods, but without having to deal with missing data, feature selection, etc. For example, something like this:

UPDATE SCHEMA FOR games ENABLE PREDICTION_TARGET(home_team_won)

CREATE PREDICTOR FOR home_team_won USING RANDOM FOREST WITH SIGNALS [ESTIMATE COLUMNS WHERE DEPENDENCE PROBABILITY WITH col > 0.2 LIMIT 20]

PREDICT home_team_won GIVEN home_team_budget > 50 AND ... (where INFER ... WITH CONFIDENCE 0 could be used to fill in any missing signals)

This is of course a very different problem than the one solved by INFER.

It also turns out if we knew a given column was a prediction target we could also boost CrossCat's predictive accuracy (at the cost of enabling the user to overfit should they insist that all discrete columns are prediction targets).

We haven't gone down any of these roads yet. If you think they might be useful, or especially if you think they'd be useless, it'd be good to know about it. What do you think?

Vikash

On Tue, Mar 4, 2014 at 2:38 PM, jostheim notifications@github.com wrote:

Sent on my code (finally).

I was asking about the constraint testing more to verify that accuracies rather than correlations, though I phrased that poorly since Mutual Information is usually for correlation. I was thinking you could setup 10 columns all correlated (and uncorrelated), and run your algorithm and then run RF and sanity test your performance. Alternatively this is not a sanity test, but a performance comparison :)

Reply to this email directly or view it on GitHubhttps://github.com/mit-probabilistic-computing-project/BayesDB/issues/14#issuecomment-36664702 .