Closed alex-lew closed 2 weeks ago
I can take this one (feel free to assign it to me, or maybe there's a setting to change so we can assign ourselves, depending on what you want to do with permissions)
@emilyfertig Just assigned you! And I think @Joaoloula just changed permissions so that you can assign yourself in the future. Thanks!!
I pushed some work on this here: https://github.com/probcomp/hierarchical-irm/tree/051624-emilyaf-generic-dist-type
It doesn't fix the issue yet, but it makes Distribution
and Relation
templated classes (still with only double
as the template parameter so the code still runs). We'll have to think through how to modify the data structures that store observations and (pointers to) relations and distributions, since those can now have different types. We could do a minimally invasive quick-ish fix, but maybe it makes sense to rethink the design more substantially (either now in the existing code base if we think it'll translate well to the hybrid HIRM/PClean system, or later as we're thinking through the design of that system). Loom had a solution for this if I remember right, which wasn't especially nice but it also wasn't obvious to me how to improve it.
I'm off work tomorrow and will pick this up again on Monday.
@emilyfertig Thanks, this looks great! And thanks for raising these issues.
Maybe it makes sense to divide the work into a couple steps:
Step 1. Make Distribution
a templated class, but have all the rest of the code explicitly use Distribution<double>
. This will unblock #7, at least, because we will be able to implement distributions on types other than double
s (even if they cannot yet be used by the rest of the code base).
We could make Relation
templated at the same time, but I wasn't sure how this would work with code like the following, which assumes the underlying distribution is BetaBernoulli
: https://github.com/probcomp/hierarchical-irm/blob/0f2d8eaa41ab32821a111ef5021a74bd540ffa28/cxx/hirm.hh#L370-L371
Maybe we will ultimately need to make Relation
parametric in the Distribution type that models its outputs, rather than in a value type alone.
Step 2. We could then figure out a general solution to handling relations with varying output types, which is also a concern for #4. I'm curious what you all think makes the most sense here. One option, since we're in C++17, is to define a type RelationVariant
:
using RelationVariant = std::variant<Relation<int>, Relation<double>, Relation<std::string>>;
Then the relations
attribute of an IRM could be a map<string, RelationVariant>
, and we could use std::visit
to call the appropriate, e.g., unincorporate
method depending on the type inside the variant:
void unincorporate(const string &r, const T_items &items) {
std::visit([&items](auto&& rel) { rel.unincorporate(items); }, relations.at(r));
}
But I'm definitely open to any design you think makes sense. I haven't worked much with the Loom codebase -- curious to hear how they handled it, if you think a similar design could work for us. (Tagging @ThomasColthurst in case this is also relevant to #4.)
We could do a minimally invasive quick-ish fix, but maybe it makes sense to rethink the design more substantially (either now in the existing code base if we think it'll translate well to the hybrid HIRM/PClean system, or later as we're thinking through the design of that system.
I think it's worth designing something that we are at least somewhat happy with. I suspect we'll be able to get a lot of the way to the hybrid model (at least an initial implementation) via incremental changes to this codebase.
I sent a PR to make Distribution a templated class, so we’re unblocked for the bigram string distribution. I held off on making changes to Relation before I spend some more time understanding the code -- thanks for outlining some of the possibilities.
I've uploaded my in-progress branch 052024-thomaswc-schema so you can see how I'm planning on parsing and storing the distribution name. It's mostly done; I'm just tracking down a segfault.
Currently, the
Distribution
base class assumes that observations are of typedouble
. We will want to support other types of observation (most notably string-valued observations). To lay the groundwork for this, we will need to makeDistribution
generic over its output type.