The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.
Currently the composer takes in a n_samples parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.
register
No major concerns.
create_generator
No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly O(nm) ~ O(n^3) for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..
drop_generator
No major concerns. For large tables, dropping the internal crosscat metamodel has empirically been shown to non-negligible time, which the composer cannot change.
initialize_models
Runs initialize for crosscat (can be slow for large datasets).
Runs create and serialize for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.
drop_models
TODO.
analyze_models
No joint inference, just crosscat analysis.
column_dependence_probability
Simple graph walk in the DAG, roughly O(E). Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.
conditional_mutual_information
Super expensive. For simulate n_samples, we need to invoke _weighted_samples roughly n_sample^2 times -- the weighted sampler is approximate and we need n_samples to get one approximate sample from the posterior. We then invoke _joint_pdf four times.
_joint_logpdf
Super expensive. We need to compute the partition function (likelihood of the evidence constraints). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.
Note that there are no known algorithms for reusing the samples for QY and Y.
predict_confidence
Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the simulate.
simulate
Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).
row_similarity
Delegates to crosscat.
row_column_predictive_probability
Delegates to column_value_probability. I have issues with the query, see comment in the code.
(de)serializing foreign predictor binaries
Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.
The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.
Currently the
composer
takes in an_samples
parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.register
No major concerns.
create_generator
No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly
O(nm) ~ O(n^3)
for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..drop_generator
No major concerns. For large tables, dropping the internal
crosscat
metamodel has empirically been shown to non-negligible time, which thecomposer
cannot change.initialize_models
Runs
initialize
forcrosscat
(can be slow for large datasets). Runscreate
andserialize
for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.drop_models
TODO.
analyze_models
No joint inference, just
crosscat
analysis.column_dependence_probability
Simple graph walk in the DAG, roughly
O(E)
. Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.conditional_mutual_information
Super expensive. For
simulate n_samples
, we need to invoke_weighted_samples
roughlyn_sample^2
times -- the weighted sampler is approximate and we needn_samples
to get one approximate sample from the posterior. We then invoke_joint_pdf
four times._joint_logpdf
Super expensive. We need to compute the partition function (likelihood of the evidence
constraints
). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.Note that there are no known algorithms for reusing the samples for
QY
andY
.predict_confidence
Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the
simulate
.simulate
Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).
row_similarity
Delegates to
crosscat
.row_column_predictive_probability
Delegates to
column_value_probability
. I have issues with the query, see comment in the code.(de)serializing foreign predictor binaries
Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.