Open axch opened 8 years ago
I did at some point intend to document some parts of bayeslite's schema to allow for some introspection.
We could invent a pragma for the purpose too:
PRAGMA bayesdb_generator_precision(satellites_cc)
('Precision' is the first word that came to mind which might be taken to mean an estimate of the expected error, or might be taken to be an estimate of the population variance, &c.)
In a separate conversation with Vikash, he wanted me to, instead of saying "BayesDB says the inferred value is 12 with 70% confidence" to say something like "BayesDB, on a population with 20 observations, 32 models run for 1200 iterations, inferred a value of 12 with 70% confidence." I take it this ticket would give me the metadata to get the phrasing right.
Should charts automatically be tagged with this metadata?
Should charts automatically be tagged with this metadata?
Yes!
One annoyance: there is nothing enforcing that all models are run for the same number of iterations, even though this is a convention. In the interest of maximum disclosure, we would need to invent a scheme that summarizes the amount of analysis done even when it is heterogeneous; also in the presence of streaming in more data.
We could take the average. (We could also multiply them!)
recipes.analysis_status() shows this info in the notebook (by returning a df with counts of iterations and number of models that have that count of iterations). There is a lower-level function per_model_analysis_status() that returns a df with each model number and its iteration count, for which analysis_status is a .value_counts().
I'm not sure the extent to which this counts as "user facing" because it's not in bayeslite (bdbcontrib), and is not part of the language, but is still just a python function.
This also doesn't really address the questions of expected precision or robustness, because of course different numbers of models and iterations will be good enough for different datasets, queries, and requirements of the answer. But that's a little bit of an open problem, isn't it?
Use case: knowing roughly what kind of result robustness to expect Use case: knowing what model ranges to specify in queries
Starting with
select * from bayesdb_generator_model
works but seems a little internal. Or do we intend to document (parts of) Bayeslite's schema to enable this sort of introspection?