probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
918 stars 64 forks source link

One can ESTIMATE but not SIMULATE mutual information with the loom backend. #622

Open Schaechtle opened 6 years ago

Schaechtle commented 6 years ago

Note that this issue should probably have gone to our fork of loom; for some reason I don't have privileges to open issues there.

The problem: you can estimate mutual information with loom, but you can't simulate it because mi-values get aggregated by computing the mean in the loom backend already: https://github.com/probcomp/loom/blob/32227b125d45f1435ff6e6f05df76b5161158bcf/loom/query.py#L281

This should be easy to fix (just remove the taking the mean) in the lines linked above.

fsaad commented 6 years ago

The lines quoted above:

        mi = entropys[feature_set1].mean \
            + entropys[feature_set2].mean \
            - entropys[feature_union].mean

are accessing an attribute of the Estimate namedtuple called mean, as opposed to invoking a method .mean() that takes the mean across an array.

It should be noted from this code that the mean here is representing the mean over the Monte Carlo samples of simulate/logpdf used to estimate the entropy, as opposed to the mean taken over the list of entropy values (one entropy value per CrossCat structure in the ensemble). Unfortunately, the latter quantity does not even appear to be directly exposed via the Loom C++ API.

After some investigation it appears that exposing the distribution of entropy (or probabilities, or samples) across the ensemble would need an alternative implementation of the QueryServer class defined in query_server.hpp and implemented in query_server.cpp, which internally aggregates over all cross_cats in the Loom configuration directory. There probably is not much conceptual difficulty with writing a QueryServer that exposes lists of results as opposed to aggregates, but it would need a non-trivial amount of work (that includes new protocol buffer message definitions for Query in schema.proto, and of course getting the whole shebang to build).

fsaad commented 5 years ago

@Schaechtle Any updates on whether this has been implemented or whether its still needed to be implemented?