probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
4 stars 2 forks source link

HIRM assigns logp_score = -inf to hospital data #167

Closed ThomasColthurst closed 1 month ago

ThomasColthurst commented 1 month ago

This can be confirmed by running the integration tests at head:

ThomasColthurst commented 1 month ago

At least in iteration #1, the nan's are coming from a subset of the relations: Condition, MeasureName, HospitalName, CountyName, Address1, PhoneNumber, HospitalType, and City. So basically all the "~ string" variables.

This points to a problem with the bigram_string distribution, since the other part of a "~ string" variable is the bigram string emission, which is shared with several other variable types, like typo_int or stringcat.

ThomasColthurst commented 1 month ago

More confirming information from running on the flights data: once again in the first iteration, all of and only the "~ string" variables get -nan values -- flight, act_arr_time, act_dep_time, sched_dep_time, and sched_arr_time.

ThomasColthurst commented 1 month ago

Adding max_length to the bigram distribution solved most of this. Now, only two typo_int fields are causing problems in the hospital data: Hospital:zip and Hospital:provider.