probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
917 stars 64 forks source link

Row-by-row model estimator for conditional probability of cell taking arbitrary value #267

Open marcoct opened 8 years ago

marcoct commented 8 years ago

In looking through the satellites demo, I would like to find satellites that are likely to be reconnaissance satellites that weren't labeled as such. My first attempt at this was:

estimate name, purpose, probability of purpose = "Reconnaissance" as prob_recon from satellites_cc

However, probability of purpose = "Reconnaissance" is a constant, because it estimates the probability under the distribution over a hypothetical next column. A version of this that is indexed by observed row, that estimates the probability under the conditional distribution of the cell given all of the other observed data in the row, would be helpful. Presumably this would have the form predictive probability of <column> = <value>, to keep in form with predictive probability of <column>.

Interestingly, predictive probability of <column> = <value> gives the same constant result as probability of <column> = <value>. Is this a bug?

axch commented 8 years ago

Hm. The existing metamodel interface exposes two functions, which correspond to the two things you didn't want: column_value_probability assumes a fresh row, and row_column_predictive_probability evaluates the probability of the value currently there, not a hypothetical one. I am in the process of refactoring that interface to permit a generic logpdf_joint method that would enable your query, but some change in the surface language would still be required to enable you to ask that question in a query.

marcoct commented 8 years ago

I think having access to the full capability of your logpdf_joint in BQL would feel empowering for me as a user: It would make me feel like I have access to the interface I (personally) am drawn to, which is the ability to sample from and evaluate probabilities under arbitrary conditional and marignal distributions derived from the full joint over the table and hypothetical rows. However, I haven't explored trying to compose these outputs into more complex objects using the tabular format.

For example, in the satellites demo, one problem I was hoping to tackle was creating a confusion matrix for a categorical column (Purpose) in which I sought to predict the Purpose from the other columns.

gregory-marton commented 8 years ago

@axch, what's the status of that refactor? Has this been enabled by the committed foreign predictor interface? If so, what would be the associated language change?

gregory-marton commented 8 years ago

Also, how is this related to #263? Would that want to use the same hypothetical?

axch commented 8 years ago

Not by the foreign predictor interface per se, but by changes to the metamodel. The desired metamodel changes are complete; the only thing standing in the way of this now would be to design and implement the surface syntax for it. If you weren't about to disapparate, I would suggest that the implementing could make a nice intro-to-the-guts project for you, @gregory-marton . Not sure how to make the decision on what the syntax should actually be, though.