probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
922 stars 64 forks source link

per-model subsampling #116

Open riastradh-probcomp opened 9 years ago

riastradh-probcomp commented 9 years ago

Subsampling is currently done in Crosscat on generator-wide -- every model is trained on the same subset of rows. The models should be done on varying subsets -- perhaps overlapping, perhaps not -- of rows.

May require changes to Crosscat, may not -- unclear.

riastradh-probcomp commented 9 years ago

This will require changes to the Crosscat database schema, since the bayesdb_crosscat_subsample table does not mention models.

fsaad commented 9 years ago

I think the changes to crosscat will be mostly in the multistate functions. If the subsets are not overlapping, my belief is that simple heuristics can help us (such as for an observed row, delegate to the model which was responsible for that row). Hypothetical rows could aggregate across one hypothetical from each model, as is currently done.

However when the subsamples are overlapping (which is the ideal case) the problem becomes more difficult (in terms of implementation in the code). The simplest thing would be to leave the kernel sweeps in posterior inference untouched, and figure out how (or if) the API facing-functions in sample_utils are affected.

riastradh-probcomp commented 9 years ago

I'm not sure we want to delegate to specific models -- it seems more plausible to me that we still want to average over all models. But I don't know for sure.

The main difficulty I expect is indexing: each Crosscat model has a contiguous assignment of row ids to the rows it models; any unmodelled rows do not have row ids. The Crosscat operations that compute averages of a single query over multiple models do not let the query vary from model to model, and since the row ids are fixed in the query, we can't use one query to ask about the same row as represented differently -- perhaps not represented at all -- from one model to the next.

One way to address this would be to push that averaging from Crosscat proper into bayeslite's crosscat.py.