probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
4 stars 2 forks source link

pclean.cc integration tests are failing #233

Open emilyfertig opened 2 weeks ago

emilyfertig commented 2 weeks ago

The assertions in CleanRelation::logp_gibbs_exact* are failing, apparently due to roundoff error. When I comment out the assertions, it still crashes with "Warning: all Dirichlet hyperparameters give nans!"

Below is the log when I run

./bazel-bin/pclean/pclean --schema=assets/flights.schema --obs=assets/flights_dirty.10.csv --iters=5 --output=/tmp/flights.out

on the branch 100424-emilyaf-bigram-debug.

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5582309.321400
Starting iteration 1, model score = -5582187.998050
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
calling logp gibbs exact on act_arr_time
true is 4:09 p.m. noisy is 4:09 p.m.
calling logp gibbs exact on act_arr_time
true is 9:28 a.m. noisy is 9:28 a.m.
calling logp gibbs exact on act_arr_time
true is 2:50 p.m. noisy is 2:50 p.m.
in relation act_arr_time_emission
oh no should be less than 2.24782e-11 but is 1
logp0 is -101233, logp score is -101234
pclean: clean_relation.hh:345: double CleanRelation<T>::logp_gibbs_exact_current(const std::vector<std::vector<int> >&) [with T = std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >]: Assertion `false' failed.
Aborted
emilyfertig commented 2 weeks ago

Maybe it isn't roundoff error, when I change the max length of the Time string from 40 to 30 the logp discrepancy is higher:

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5781017.216540
Starting iteration 1, model score = -5780895.893191
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
in relation sched_arr_time_emission
oh no should be less than 2.24782e-11 but is 18
logp0 is -101233, logp score is -101215
emilyfertig commented 2 weeks ago

I commented out the failing logp checks in CleanRelation to try to get some insight into why the Dirichlet hparams were NaN, and it appears there are NaNs in the counts vector of the bigram insertions distribution.

Transitioning bigram
Transitioning bigram insertions
counts in dirichlet cat is 
-nan 0 0 -nan -nan 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 0 0 0 0 0 -nan 0 0 -nan 0 0 -nan 0 0 0 -nan 0 0 0 0 0 -nan 0 0 0 0 0 0 -nan 0 0 0 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 -nan 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 
Warning: all Dirichlet hyperparameters give nans!
pclean: distributions/dirichlet_categorical.cc:62: virtual void DirichletCategorical::transition_hyperparameters(std::mt19937*): Assertion `false' failed.
Aborted
emilyfertig commented 2 weeks ago

This looks weird to me: https://github.com/probcomp/hierarchical-irm/blob/master/cxx/emissions/bigram_string.cc#L138

  double total_prob = 0.0;
  for (auto& a : alignments) {
    a.cost = exp(a.cost);  // Turn all costs into non-log probabilities
    total_prob += a.cost;
  }

  for (const auto& a : alignments) {
    double w = weight * a.cost / total_prob;

We're adding up exp(a.cost) to get total prob, but then we're weighting a.cost without the exp (and dividing by total_prob). I suspect it should be exp(a.cost) on the last line of the snippet too, but when I make that change just to try it, the code hangs.