probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
917 stars 64 forks source link

Estimate conditional dependence probabilities #239

Open marcoct opened 8 years ago

marcoct commented 8 years ago

When apogee_km is 40000 (or more generally between 30000 and 50000), the distribution of perigee_km is bimodal:

q('''create temp table apsim as SIMULATE apogee_km, perigee_km FROM satellites_cc LIMIT 10000
''')
_ = bdbcontrib.pairplot(satellites_bdb, 'select * from apsim where apogee_km > 30000 AND apogee_km < 50000')

image

I would like to know what other variables might help explain this bimodality. I can run this query:

q('''ESTIMATE *, DEPENDENCE PROBABILITY WITH perigee_km as dep
FROM COLUMNS OF satellites_cc
ORDER BY dep DESC;''')

But I cannot run a conditional version with GIVEN apogee_km = 40000 as the condition, e.g.:

q('''ESTIMATE *, DEPENDENCE PROBABILITY WITH perigee_km GIVEN apogee_km = 40000 as dep
FROM COLUMNS OF satellites_cc
ORDER BY dep DESC;''')

The WHERE expression within ESTIMATE is unrelated.

fsaad commented 8 years ago

Very nice observation. This feature is part of extending GPMs for a more flexible interface, which will bring changes to BQL. The composer code has an API invokeable function conditional_mutual_information which is not acessible through BQL.

Implementation-wise, there are difficulties. Once we have conditional mutual information, it is not entirely clear whether we can use crosscat's definition of independence (different views). In particular, independence is not closed under conditioning, so marginal dependence has no relation to conditional, etc.

We resort to simple Monte Carlo of the mutual information, using the appropriate calls to a general simulate and logpdf.

riastradh-probcomp commented 8 years ago

See also #79.