Open riastradh-probcomp opened 9 years ago
This will require changes to the Crosscat database schema, since the bayesdb_crosscat_subsample table does not mention models.
I think the changes to crosscat will be mostly in the multistate
functions. If the subsets are not overlapping, my belief is that simple heuristics can help us (such as for an observed row, delegate to the model which was responsible for that row). Hypothetical rows could aggregate across one hypothetical from each model, as is currently done.
However when the subsamples are overlapping (which is the ideal case) the problem becomes more difficult (in terms of implementation in the code). The simplest thing would be to leave the kernel sweeps in posterior inference untouched, and figure out how (or if) the API facing-functions in sample_utils
are affected.
I'm not sure we want to delegate to specific models -- it seems more plausible to me that we still want to average over all models. But I don't know for sure.
The main difficulty I expect is indexing: each Crosscat model has a contiguous assignment of row ids to the rows it models; any unmodelled rows do not have row ids. The Crosscat operations that compute averages of a single query over multiple models do not let the query vary from model to model, and since the row ids are fixed in the query, we can't use one query to ask about the same row as represented differently -- perhaps not represented at all -- from one model to the next.
One way to address this would be to push that averaging from Crosscat proper into bayeslite's crosscat.py.
Subsampling is currently done in Crosscat on generator-wide -- every model is trained on the same subset of rows. The models should be done on varying subsets -- perhaps overlapping, perhaps not -- of rows.
May require changes to Crosscat, may not -- unclear.