Make `Distribution` generic in its output type

alex-lew commented 1 month ago

Currently, the Distribution base class assumes that observations are of type double. We will want to support other types of observation (most notably string-valued observations). To lay the groundwork for this, we will need to make Distribution generic over its output type.

emilyfertig commented 1 month ago

I can take this one (feel free to assign it to me, or maybe there's a setting to change so we can assign ourselves, depending on what you want to do with permissions)

alex-lew commented 1 month ago

@emilyfertig Just assigned you! And I think @Joaoloula just changed permissions so that you can assign yourself in the future. Thanks!!

emilyfertig commented 1 month ago

I pushed some work on this here: https://github.com/probcomp/hierarchical-irm/tree/051624-emilyaf-generic-dist-type

It doesn't fix the issue yet, but it makes Distribution and Relation templated classes (still with only double as the template parameter so the code still runs). We'll have to think through how to modify the data structures that store observations and (pointers to) relations and distributions, since those can now have different types. We could do a minimally invasive quick-ish fix, but maybe it makes sense to rethink the design more substantially (either now in the existing code base if we think it'll translate well to the hybrid HIRM/PClean system, or later as we're thinking through the design of that system). Loom had a solution for this if I remember right, which wasn't especially nice but it also wasn't obvious to me how to improve it.

I'm off work tomorrow and will pick this up again on Monday.

alex-lew commented 1 month ago

@emilyfertig Thanks, this looks great! And thanks for raising these issues.

Maybe it makes sense to divide the work into a couple steps:

Step 1. Make Distribution a templated class, but have all the rest of the code explicitly use Distribution<double>. This will unblock #7, at least, because we will be able to implement distributions on types other than doubles (even if they cannot yet be used by the rest of the code base).

We could make Relation templated at the same time, but I wasn't sure how this would work with code like the following, which assumes the underlying distribution is BetaBernoulli: https://github.com/probcomp/hierarchical-irm/blob/0f2d8eaa41ab32821a111ef5021a74bd540ffa28/cxx/hirm.hh#L370-L371

Maybe we will ultimately need to make Relation parametric in the Distribution type that models its outputs, rather than in a value type alone.

Step 2. We could then figure out a general solution to handling relations with varying output types, which is also a concern for #4. I'm curious what you all think makes the most sense here. One option, since we're in C++17, is to define a type RelationVariant:

using RelationVariant = std::variant<Relation<int>, Relation<double>, Relation<std::string>>;

Then the relations attribute of an IRM could be a map<string, RelationVariant>, and we could use std::visit to call the appropriate, e.g., unincorporate method depending on the type inside the variant:

void unincorporate(const string &r, const T_items &items) {
        std::visit([&items](auto&& rel) { rel.unincorporate(items); }, relations.at(r));
}

But I'm definitely open to any design you think makes sense. I haven't worked much with the Loom codebase -- curious to hear how they handled it, if you think a similar design could work for us. (Tagging @ThomasColthurst in case this is also relevant to #4.)

We could do a minimally invasive quick-ish fix, but maybe it makes sense to rethink the design more substantially (either now in the existing code base if we think it'll translate well to the hybrid HIRM/PClean system, or later as we're thinking through the design of that system.

I think it's worth designing something that we are at least somewhat happy with. I suspect we'll be able to get a lot of the way to the hybrid model (at least an initial implementation) via incremental changes to this codebase.

emilyfertig commented 1 month ago

I sent a PR to make Distribution a templated class, so we’re unblocked for the bigram string distribution. I held off on making changes to Relation before I spend some more time understanding the code -- thanks for outlining some of the possibilities.

ThomasColthurst commented 1 month ago

I've uploaded my in-progress branch 052024-thomaswc-schema so you can see how I'm planning on parsing and storing the distribution name. It's mostly done; I'm just tracking down a segfault.

probcomp / hierarchical-irm

Make `Distribution` generic in its output type #3