Closed ThomasColthurst closed 1 month ago
I think this design sounds good. In the future it might be nice to return top k nearest or a sample from something around nearest, but I think what you proposed is a good initial solution. (For distributions that take a finite number of values, it would be straightforward to take the logp of the noisy observation given each valid distribution value, and sample from that or return the top k). I'll think more about other approaches that we (or someone else) could consider longer term.
Lines 110-12 of transition_latent_value.hh, inside latent_values_incremental_logp are:
for (const ValueType& v : latent_values) { // Incorporate the latent/noisy values for this candidate. base_relation->incorporate_to_cluster(base_items, v);
When the distribution of the base_relation is something like StringCat, this can cause a crash when the latent_value is outside of the distribution's range of valid values. (For example, not on the StringCat's list of valid strings).
The latent values are computed on lines 51-57 of transition_latent_value.hh: std::vector latent_value_candidates;
for (auto [name, rel] : noisy_relations) {
for (const auto& [i, em] : rel->get_emission_clusters()) {
latent_value_candidates.push_back(
em->propose_clean(all_noisy_observations, prng));
}
}
We need to augment this procedure with something that takes information from the distribution of the base_relation. One possibility would be to add a "nearest" method to distributions/base.hh::Distribution that returns the nearest valid value given a possibly invalid value. This method can default to the identity function, and we can override it for the distributions that need it.
(I believe the distributions that would need it are: DistributionAdapter
DistributionAdapter
DirichletCategorical
StringCat
StringNat
)
Assigining to Emily to approve this design or suggest another. Assigning to Thomas to implement.