pclean.cc integration tests are failing

emilyfertig commented 1 month ago

The assertions in CleanRelation::logp_gibbs_exact* are failing, apparently due to roundoff error. When I comment out the assertions, it still crashes with "Warning: all Dirichlet hyperparameters give nans!"

Below is the log when I run

./bazel-bin/pclean/pclean --schema=assets/flights.schema --obs=assets/flights_dirty.10.csv --iters=5 --output=/tmp/flights.out

on the branch 100424-emilyaf-bigram-debug.

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5582309.321400
Starting iteration 1, model score = -5582187.998050
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
calling logp gibbs exact on act_arr_time
true is 4:09 p.m. noisy is 4:09 p.m.
calling logp gibbs exact on act_arr_time
true is 9:28 a.m. noisy is 9:28 a.m.
calling logp gibbs exact on act_arr_time
true is 2:50 p.m. noisy is 2:50 p.m.
in relation act_arr_time_emission
oh no should be less than 2.24782e-11 but is 1
logp0 is -101233, logp score is -101234
pclean: clean_relation.hh:345: double CleanRelation<T>::logp_gibbs_exact_current(const std::vector<std::vector<int> >&) [with T = std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >]: Assertion `false' failed.
Aborted

emilyfertig commented 1 month ago

Maybe it isn't roundoff error, when I change the max length of the Time string from 40 to 30 the logp discrepancy is higher:

Setting seed to 10
Reading plcean schema ...
Reading schema file from assets/flights.schema
Making GenDB model ...
Reading observations ...
Reading observations file from assets/flights_dirty.10.csv
Incorporating observations ...
Schema does not contain tuple_id, skipping ...
Running inference ...
Starting outer iteration 1, model score = -5781017.216540
Starting iteration 1, model score = -5780895.893191
calling logp gibbs exact on act_arr_time
true is 9:32 a.m. noisy is 9:32 a.m.
in relation sched_arr_time_emission
oh no should be less than 2.24782e-11 but is 18
logp0 is -101233, logp score is -101215

emilyfertig commented 1 month ago

I commented out the failing logp checks in CleanRelation to try to get some insight into why the Dirichlet hparams were NaN, and it appears there are NaNs in the counts vector of the bigram insertions distribution.

Transitioning bigram
Transitioning bigram insertions
counts in dirichlet cat is 
-nan 0 0 -nan -nan 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 0 0 0 0 0 -nan 0 0 -nan 0 0 -nan 0 0 0 -nan 0 0 0 0 0 -nan 0 0 0 0 0 0 -nan 0 0 0 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 -nan 0 0 0 -nan 0 0 0 -nan 0 0 0 0 -nan 0 0 0 0 0 0 0 0 0 0 0 0 0 -nan 
Warning: all Dirichlet hyperparameters give nans!
pclean: distributions/dirichlet_categorical.cc:62: virtual void DirichletCategorical::transition_hyperparameters(std::mt19937*): Assertion `false' failed.
Aborted

emilyfertig commented 1 month ago

This looks weird to me: https://github.com/probcomp/hierarchical-irm/blob/master/cxx/emissions/bigram_string.cc#L138

  double total_prob = 0.0;
  for (auto& a : alignments) {
    a.cost = exp(a.cost);  // Turn all costs into non-log probabilities
    total_prob += a.cost;
  }

  for (const auto& a : alignments) {
    double w = weight * a.cost / total_prob;

We're adding up exp(a.cost) to get total prob, but then we're weighting a.cost without the exp (and dividing by total_prob). I suspect it should be exp(a.cost) on the last line of the snippet too, but when I make that change just to try it, the code hangs.

probcomp / hierarchical-irm

pclean.cc integration tests are failing #233