probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
4 stars 2 forks source link

Fix `make_hirm_schema` so that input classes of noisy relations contain only destinations of DAG paths (and base relation input classes) #188

Closed emilyfertig closed 1 month ago

emilyfertig commented 2 months ago

For example, in the following block from schema_helper_test, test_make_hirm_schema

  BOOST_TEST(tschema.contains("Specialty"));
  T_noisy_relation nr1 = std::get<T_noisy_relation>(tschema["Specialty"]);
  ...
  expected_domains = {"School", "Physician", "City", "Practice", "Record"};
  BOOST_TEST(nr1.domains == expected_domains, tt::per_element());

expected_domains should be {"School", "Physician", "Record"} since the DAG is School -> Physician -> Record <- Practice <- City (and School, Physician are the input classes of the base relation, and Record is the only class that's the destination of a path starting at Physician)

ThomasColthurst commented 1 month ago

Adding Alex to answer the question of whether we want to make the old behavior optional with a flag.

I ask because it seems like currently the GenDB doc treats clean and noisy relations a bit asymmetrically in terms of what domains they can condition on -- the clean relations get the maximal set of available domains, but the noisy relations get the minimal set. I was curious as to whether this asymmetry was intentional, and if so whether it was motivated by (a) computational concerns [i.e., there are typically a lot more noisy relations than clean relations], (b) world modelling concerns [i.e., it makes sense that Physician names might vary from state to state, but the typo function applied to the name of the Physician's med school probably doesn't vary much from state to state], (c) compatibility with how PClean did things, or (d) other.