probcomp / bayeslite

BayesDB on SQLite. A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself.
http://probcomp.csail.mit.edu/software/bayesdb
Apache License 2.0
919 stars 63 forks source link

draft a scheme for deciding agreement of statistical type with distributions in some metamodel #460

Open riastradh-probcomp opened 8 years ago

riastradh-probcomp commented 8 years ago

For example, it might reject attempts to model a categorical variable with a normal distribution. This issue will be satisfied not when we have a clear, coherent design implemented that we will stand behind for the rest of time, but when we have a draft of an idea to experiment with in practice to play around with it. The concept of 'statistical type' remains fuzzy.

The best candidate summary I have come up with so far for the definition of 'statistical type' is 'the topology of a support of a random variable'. The 'topology' part lets us meaningfully distinguish numerical from cyclic, for example, both of which are supported on the entire real line, but with entirely different topologies. We also want to distinguish, e.g., ordered counts from unordered names.

The statistical type determines what operations are meaningful on the values that might appear for it, such as 'computing the logarithm' (not meaningful if values may be negative) or 'drawing a histogram sorted by height' (probably not what you want if values have an intrinsic ordering) or 'drawing a scatter plot' (not meaningful for pairs of variables that are unordered names).

vkmvkmvkmvkm commented 8 years ago

Feras is (I believe) on this as part of his paper revisions, with the notion of "statistical type" coinciding with "canonical base measure that must dominate the measure associated with any distribution chosen during modeling".

On Thu, Jul 14, 2016 at 7:17 PM, riastradh-probcomp < notifications@github.com> wrote:

For example, it might reject attempts to model a categorical variable with a normal distribution. This issue will be satisfied not when we have a clear, coherent design implemented that we will stand behind for the rest of time, but when we have a draft of an idea to experiment with in practice to play around with it. The concept of 'statistical type' remains fuzzy.

The best candidate summary I have come up with so far for the definition of 'statistical type' is 'the topology of a support of a random variable'. The 'topology' part lets us meaningfully distinguish numerical from cyclic, for example, both of which are supported on the entire real line, but with entirely different topologies. We also want to distinguish, e.g., ordered counts from unordered names.

The statistical type determines what operations are meaningful on the values that might appear for it, such as 'computing the logarithm' (not meaningful if values may be negative) or 'drawing a histogram sorted by height' (probably not what you want if values have an intrinsic ordering) or 'drawing a scatter plot' (not meaningful for pairs of variables that are unordered names).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/probcomp/bayeslite/issues/460, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXzoJi86z657tCwsa00kLB1Pleil84Sks5qVsOkgaJpZM4JM8Nw .

fsaad commented 8 years ago

Yes, In addition to the base measure there are also the qualitative invariants that define the statistical type more completely -- for instance a count variable is very different than an open set categorical (both taking values on N, the former numeric and ordered, the latter symbolic and unordered) and that information needs to be encoded besides the support/base measure.