probcomp / cgpm

Library of composable generative population models which serve as the modeling and inference backend of BayesDB.
Apache License 2.0
25 stars 11 forks source link

Feasibility of integrating Tree-Cat? #216

Open fritzo opened 7 years ago

fritzo commented 7 years ago

Hi all, I'm looking for a way to test Tree-Cat, a generalization of Cross-Cat to latent tree models (currently only for categorical and ordinal-as-binomial features). I'm guessing that you've built a suite of datasets and evaluation metrics on top of cgpm, so I thought an easy way to test Tree-Cat would be support a standard cgpm engine interface in a treecat.cgpm module, and then add a little cgpm.treecat integration in this repo.

  1. Does cgpm have a standard engine interface that I can support?
  2. Have you published cgpm's evaluation suite, or would you be willing to help running eval internally?

Thanks!

fsaad commented 7 years ago

Hi @fritzo, thanks for writing us about Tree-Cat. I looked through the model and find it very interesting and creative.

Does cgpm have a standard engine interface that I can support?

There is a straightforward path toward integrating Tree-Cat into either:

Both of these interfaces would allow Tree-Cat to be used as a modeling backend using the Metamodeling Langauge, and queried using the Bayesian Query Language in BayesDB, which is typically how we run model evaluations. Which of the two interfaces makes more sense to implements depends on what features of Tree-Cat we wish to expose to the end-user --- for example, implementing the CGpm interface would allow us to compose Tree-Cat with other models in the repository, at the expense of some overhead in query runtime; implementing the IBayesDBMetamodel interfaces makes it easier to optimize the implementations of simulate/logpdf as invoked by BayesDB, at the expense less flexibility for compositing Tree-Cat with other models. We should also consider whether Tree-Cat has any built-in multiprocessing capabilities (which the CGpm integration automatically provides, but the IBayesDBMetamodel does not).

Have you published cgpm's evaluation suite, or would you be willing to help running eval internally?

We have run various benchmarks for cgpms although I none our evaluation suites are particularly suited for nominal/ordinal data for Tree-Cat. My sense is that we can together benchmark Tree-Cat by:

  1. Identifying a set of synthetic and/or real-world datasets which are suitable for the Tree-Cat prior;
  2. Specifying a set of queries in the Bayesian Query Language (such as density estimation, conditional simulation, etc), as well as error metrics for each query;
  3. Comparing the predictive performance of Tree-Cat against baselines as vanilla DPMMs, CrossCat, noisy discriminative models, and/or sum-product networks (most of these baselines are already implemented as cgpms in this repo);

By writing the benchmark suite in BQL, which is model independent, we can logically separate the task of defining the evaluation set from the task of implementing baseline models to run the queries against. Further extensions may include comparing the performance of Tree-Cat and baselines, varying the amount model analysis and/or any tunable query parameters.

Let me know if you have thought about what datasets and queries could be appropriate for benchmarking Tree-Cat.

fritzo commented 7 years ago

Thanks @fsaad for your detailed response! It looks like the CGpm interface will be the easiest for me to integrate with, so I will refactor towards that interface.

Re: multiprocessing, TreeCat achieves efficient querying by batching queries and vectorizing the math internally using numpy. This has made the most sense for my use cases, e.g. crossvalidating on an entire dataset, or computing mutual information by evaluating logprob on a batch of samples. Do you have any plans to add a batch query interface to CGpm?

Re: datasets, I am currently testing with two private social services datasets (20K rows, 200 features, categorical and ordinal). I could test other models and publish the test results. I think a good public dataset would be a text mining dataset like Enron emails for text mining (500K rows, 1000s of sparse boolean features). Text mining seems to be the main application for Zhang and Poon's Latent Tree Analysis, a model very similar to TreeCat. I am currently working on an Enron analysis blogpost. Could you point me towards any existing model comparison code/notebooks using CGpm, as a starting point for analyzing these datasets in a CGpm-compatible way?