Is there a user-facing way to discover how many models a pre-analyzed file has?

probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.

http://probcomp.csail.mit.edu/software/bayesdb

Apache License 2.0

921 stars 64 forks source link

Is there a user-facing way to discover how many models a pre-analyzed file has? #246

Open axch opened 8 years ago

axch commented 8 years ago

Use case: knowing roughly what kind of result robustness to expect Use case: knowing what model ranges to specify in queries

Starting with select * from bayesdb_generator_model works but seems a little internal. Or do we intend to document (parts of) Bayeslite's schema to enable this sort of introspection?

riastradh-probcomp commented 8 years ago

I did at some point intend to document some parts of bayeslite's schema to allow for some introspection.

We could invent a pragma for the purpose too:

PRAGMA bayesdb_generator_precision(satellites_cc)

('Precision' is the first word that came to mind which might be taken to mean an estimate of the expected error, or might be taken to be an estimate of the population variance, &c.)

tibbetts commented 8 years ago

In a separate conversation with Vikash, he wanted me to, instead of saying "BayesDB says the inferred value is 12 with 70% confidence" to say something like "BayesDB, on a population with 20 observations, 32 models run for 1200 iterations, inferred a value of 12 with 70% confidence." I take it this ticket would give me the metadata to get the phrasing right.

Should charts automatically be tagged with this metadata?

riastradh-probcomp commented 8 years ago

Should charts automatically be tagged with this metadata?

Yes!

axch commented 8 years ago

One annoyance: there is nothing enforcing that all models are run for the same number of iterations, even though this is a convention. In the interest of maximum disclosure, we would need to invent a scheme that summarizes the amount of analysis done even when it is heterogeneous; also in the presence of streaming in more data.

riastradh-probcomp commented 8 years ago

We could take the average. (We could also multiply them!)

gregory-marton commented 8 years ago

recipes.analysis_status() shows this info in the notebook (by returning a df with counts of iterations and number of models that have that count of iterations). There is a lower-level function per_model_analysis_status() that returns a df with each model number and its iteration count, for which analysis_status is a .value_counts().

I'm not sure the extent to which this counts as "user facing" because it's not in bayeslite (bdbcontrib), and is not part of the language, but is still just a python function.

This also doesn't really address the questions of expected precision or robustness, because of course different numbers of models and iterations will be good enough for different datasets, queries, and requirements of the answer. But that's a little bit of an open problem, isn't it?