Performance issues and considerations for composer

The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.

Currently the composer takes in a n_samples parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.

register

No major concerns.

create_generator

No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly O(nm) ~ O(n^3) for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..

drop_generator

No major concerns. For large tables, dropping the internal crosscat metamodel has empirically been shown to non-negligible time, which the composer cannot change.

initialize_models

Runs initialize for crosscat (can be slow for large datasets). Runs create and serialize for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.

drop_models

TODO.

analyze_models

No joint inference, just crosscat analysis.

column_dependence_probability

Simple graph walk in the DAG, roughly O(E). Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.

conditional_mutual_information

Super expensive. For simulate n_samples, we need to invoke _weighted_samples roughly n_sample^2 times -- the weighted sampler is approximate and we need n_samples to get one approximate sample from the posterior. We then invoke _joint_pdf four times.

_joint_logpdf

Super expensive. We need to compute the partition function (likelihood of the evidence constraints). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.

Note that there are no known algorithms for reusing the samples for QY and Y.

predict_confidence

Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the simulate.

simulate

Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).

row_similarity

Delegates to crosscat.

row_column_predictive_probability

Delegates to column_value_probability. I have issues with the query, see comment in the code.

(de)serializing foreign predictor binaries

Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.

probcomp / bdbcontrib