probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
4 stars 2 forks source link

latent_values_incremental_logp can call base_relation->incorporate with latent values that are out of distribution's set of valid values #216

Closed ThomasColthurst closed 1 month ago

ThomasColthurst commented 2 months ago

Lines 110-12 of transition_latent_value.hh, inside latent_values_incremental_logp are:

for (const ValueType& v : latent_values) { // Incorporate the latent/noisy values for this candidate. base_relation->incorporate_to_cluster(base_items, v);

When the distribution of the base_relation is something like StringCat, this can cause a crash when the latent_value is outside of the distribution's range of valid values. (For example, not on the StringCat's list of valid strings).

The latent values are computed on lines 51-57 of transition_latent_value.hh: std::vector latent_value_candidates; for (auto [name, rel] : noisy_relations) { for (const auto& [i, em] : rel->get_emission_clusters()) { latent_value_candidates.push_back( em->propose_clean(all_noisy_observations, prng)); } }

We need to augment this procedure with something that takes information from the distribution of the base_relation. One possibility would be to add a "nearest" method to distributions/base.hh::Distribution that returns the nearest valid value given a possibly invalid value. This method can default to the identity function, and we can override it for the distributions that need it.

(I believe the distributions that would need it are: DistributionAdapter DistributionAdapter DirichletCategorical StringCat StringNat )

Assigining to Emily to approve this design or suggest another. Assigning to Thomas to implement.

emilyfertig commented 1 month ago

I think this design sounds good. In the future it might be nice to return top k nearest or a sample from something around nearest, but I think what you proposed is a good initial solution. (For distributions that take a finite number of values, it would be straightforward to take the logp of the noisy observation given each valid distribution value, and sample from that or return the top k). I'll think more about other approaches that we (or someone else) could consider longer term.

ThomasColthurst commented 1 month ago

Fixed in https://github.com/probcomp/hierarchical-irm/pull/224