Open marcoct opened 8 years ago
Hm. The existing metamodel interface exposes two functions, which correspond to the two things you didn't want: column_value_probability
assumes a fresh row, and row_column_predictive_probability
evaluates the probability of the value currently there, not a hypothetical one. I am in the process of refactoring that interface to permit a generic logpdf_joint
method that would enable your query, but some change in the surface language would still be required to enable you to ask that question in a query.
I think having access to the full capability of your logpdf_joint
in BQL would feel empowering for me as a user: It would make me feel like I have access to the interface I (personally) am drawn to, which is the ability to sample from and evaluate probabilities under arbitrary conditional and marignal distributions derived from the full joint over the table and hypothetical rows. However, I haven't explored trying to compose these outputs into more complex objects using the tabular format.
For example, in the satellites demo, one problem I was hoping to tackle was creating a confusion matrix for a categorical column (Purpose) in which I sought to predict the Purpose from the other columns.
@axch, what's the status of that refactor? Has this been enabled by the committed foreign predictor interface? If so, what would be the associated language change?
Also, how is this related to #263? Would that want to use the same hypothetical?
Not by the foreign predictor interface per se, but by changes to the metamodel. The desired metamodel changes are complete; the only thing standing in the way of this now would be to design and implement the surface syntax for it. If you weren't about to disapparate, I would suggest that the implementing could make a nice intro-to-the-guts project for you, @gregory-marton . Not sure how to make the decision on what the syntax should actually be, though.
In looking through the satellites demo, I would like to find satellites that are likely to be reconnaissance satellites that weren't labeled as such. My first attempt at this was:
However,
probability of purpose = "Reconnaissance"
is a constant, because it estimates the probability under the distribution over a hypothetical next column. A version of this that is indexed by observed row, that estimates the probability under the conditional distribution of the cell given all of the other observed data in the row, would be helpful. Presumably this would have the formpredictive probability of <column> = <value>
, to keep in form withpredictive probability of <column>
.Interestingly,
predictive probability of <column> = <value>
gives the same constant result asprobability of <column> = <value>
. Is this a bug?