probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
4 stars 2 forks source link

Add Emissions model base class/subclasses #52

Closed emilyfertig closed 2 months ago

ThomasColthurst commented 3 months ago

Are there any other Emission models and/or base class methods that you think we need?

Otherwise, I will close this issue.

emilyfertig commented 3 months ago

We probably want a way of noising up categorical data. A simple one could be to sample another categorical distribution with the same number of categories (probably combined with Sometimes), either with an inferred alpha or just uniformly to start.

That would model mislabeled values, but looking at the top of page 4 in the PClean paper (https://arxiv.org/pdf/2007.11838), there's also a notion of "Noisy Categorical" that models typos in the labels. Our code currently assumes categoricals have a known number of categories (encoded as integers 0 through k-1), and the presence of label typos means we don't know the number of categories up front, so I think we'd need a CRP instead. That would require additional code changes, like making CRP a Distribution subclass and associating the tables with strings (or we could treat categoricals as a separate Class in the PClean sense and re-use that machinery -- I'm not sure if there's a reason not to do that).

Mis-categorizations and typos both strike me as things we want in the model eventually, and fairly high priority (although label typos especially seems like a significant amount of work, so if you're focusing on other things you could close this and open other issue(s) for noisy categoricals that others could get to as we have time).

Curious as to @alex-lew and @Joaoloula thoughts on this.

ThomasColthurst commented 3 months ago

That all sounds fun enough that I'll selfishly keep this task assigned to myself. :)

One interesting complication is that Sometimes requires the Emission it is adapting to assign zero probability to the true value, which wouldn't be straightforwardly true for either another categorical or CRP distribution.