If model evaluates at initial conditions without autodiff but fails with autodiff then the error messages are not great

bbbales2 commented 3 years ago

Summary:

For instance, with an ODE model @jtimonen found output like this:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

We would expect an error message like:

Rejecting initial value:
  Error evaluating the log probability with gradients at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

Initialization between (-2, 2) failed after 100 attempts. 
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Initialization failed.

Here is a model that exhibits this behavior at certain seeds (model below):

functions {
  // Lotka-Volterra system                                                                                                                                                                    
  vector derivative_fun(real t, vector y, data int[] a0, vector theta) {
    vector[2] dydt;
    dydt[1] = theta[1]*y[1] - y[1]*y[2];
    dydt[2] = y[1]*y[2] - theta[2]*y[2];
    return dydt;
  }
}

data {
  int<lower=1> N;
  real t_eval[N]; // must be increasing                                                                                                                                                       
  vector[2] y_data[N];
  vector[2] y0;
  real t0;
  real<lower=0> ABS_TOL;
  real<lower=0> REL_TOL;
  int<lower=1> MAX_NUM_STEPS;
}

transformed data {
  int a0[0];
}

parameters {
  vector<lower=0>[2] theta;
  real<lower=0> sigma;
}

model {
  theta ~ normal(1, 0.3);
  sigma ~ normal(0, 2.0);
  vector[2] y_hat[N] = ode_bdf_tol(derivative_fun, y0, t0, t_eval,
    REL_TOL, ABS_TOL, MAX_NUM_STEPS, a0, theta);
  for(n in 1:N){
    target += normal_lpdf(y_data[n] | y_hat[n], sigma);
  }
}

Json data:

{
  "N": 20,
  "t_eval": [1.542, 3.314, 3.582, 3.942, 4.618, 5.974, 6.338, 6.546, 6.698, 7.106, 7.53, 7.602, 7.758, 8.334, 8.45, 9.018, 9.17, 10.886, 11.366, 11.574],
  "y0": [1, 2],
  "t0": 0,
  "y_data": [
    [0.344797088393873, 1.6930242060351],
    [0.635418289383446, 0.0906638098070269],
    [0.519516090322978, 0.821024324067281],
    [1.07889503979525, 0.477741038303985],
    [0.773776915048449, 0.255210793923543],
    [1.72581484166438, 1.5527620189067],
    [0.851008213891165, 1.80589311943048],
    [1.62769164960979, 1.57197163973671],
    [1.06907814440105, 1.77751680937944],
    [0.559461058239121, 1.77778520957167],
    [0.319224257399646, 2.09229342823705],
    [0.656623356899921, 2.35616034285975],
    [0.266800987984575, 1.59088141802012],
    [-0.0843099951671533, 1.29952132205262],
    [0.510139049085921, 1.16569483217513],
    [0.739499156159327, 1.5954691564362],
    [0.300537914354303, 1.33024642117391],
    [0.453591698548004, 0.627159819286871],
    [1.13100675948511, 0.372307762827462],
    [1.63136411638082, 1.04814114632482]
  ],
  "REL_TOL": 1e-06,
  "ABS_TOL": 1e-06,
  "MAX_NUM_STEPS": 30
}

Current Version:

v2.26.1

jtimonen commented 3 years ago

Yea, so it seems that the current code assumes that if computing log probability succeeds then also computing the gradient will succeed. With ODEs however, it is possible that log_prob succeeds for a given max_num_steps, but to compute the gradient, an extended system has to be solved and it fails. There could be also other models that have this sort of behaviour.

betanalpha commented 3 years ago

Yea, so it seems that the current code assumes that if computing log probability succeeds then also computing the gradient will succeed. With ODEs however, it is possible that log_prob succeeds for a given max_num_steps, but to compute the gradient, an extended system has to be solved and it fails. There could be also other models that have this sort of behaviour.

The code does not make this assumption.

As detailed in https://github.com/stan-dev/stan/blob/develop/src/stan/services/util/initialize.hpp the initialization begins by checking the double evaluation of log_prob (line 128). The "Error evaluating the log probability at the initial value” message is passed after the double evaluation but before the gradient is considered, which means that the double evaluation alone is failing in @jtimonen ’s model (I’m guessing the replication of the message is due to four chains all dumping the same error). The autodiff evaluation of log_prob is then checked (line 166) in the same loop, and if the evaluation fails then the message "Gradient evaluated at the initial value is not finite.” is passed.

Again because "Error evaluating the log probability at the initial value” is passed to the interface here the initialization attempt is stopped before the gradient is even considered. The gradient is always checked but only after a double evaluation returns successfully.

bbbales2 commented 3 years ago

It is throwing an error in log_prob_grad. Seems like it is printing the error twice. Not sure why that's happening.

I modified the initialize.hpp code and added some stuff around line 160:

    std::cout << "meow" << std::endl << std::flush;

    std::stringstream log_prob_msg;
    std::vector<double> gradient;
    auto start = std::chrono::steady_clock::now();
    try {
      // we evaluate this with propto=true since we're                                                                                                                                                                       
      // evaluating with autodiff variables                                                                                                                                                                                  
      log_prob = stan::model::log_prob_grad<true, Jacobian>(
          model, unconstrained, disc_vector, gradient, &log_prob_msg);
    } catch (const std::exception& e) {
      if (log_prob_msg.str().length() > 0)
        logger.info(log_prob_msg);
      logger.info(e.what());
      throw;
    }

    std::cout << "woof" << std::endl << std::flush;

Output is:

Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
meow
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
~/cmdstan-develop$

betanalpha commented 3 years ago

You need to separate out loop iterations, for example by printing "num_init_tries". When "Error evaluating the log probability at the initial value." is passed to the logger the gradient evaluation isn't even attempted due to the "continue" statement on line 140. My guess is that the first attempt fails with a bad function evaluation and then on the second loop iteration the function evolution passes but then the gradient evolution fails which triggers the throw on line 172.

If this is the case then all we need to do is add a logger message before 171 to clarify the failure of 'log_prog_grad,

      logger.info(
          "  Error evaluating the gradient of the log probability"
          " at the initial value.");

bbbales2 commented 3 years ago

We could throw a more informative error but the code will behave more robustly if we just initialize where gradients also work so it's possible to start running HMC.

I believe that is the intention of this code (just find anywhere to start), and the difficulty of doing the ODE solves causes this problem.

betanalpha commented 3 years ago

I believe that is the intention of this code (just find anywhere to start), and the difficulty of doing the ODE solves causes this problem.

The code is very clearly laid out and is symmetric between double and var evaluations. If the evaluation completes successfully and returns well-defined values then the initialization terminates successful. if it completes successfully and returns an ill-defined value then it repeats. If it throws an error (other than a domain error which only needs to be checked once since double and var evaluations have the same domain) then the initialization terminates unsuccessfully.

CVODES throws a non-domain exception when the integration of the sensitivity states fails, triggering the initialization to terminate as expected from the current implementation.

Changing the initialization logic to retry on var-evaluation exceptions requires a more careful discussion. I recommend closing this issue and opening a new issue on the proper topic.

bbbales2 commented 3 years ago

and is symmetric between double and var evaluations.

I don't know what symmetry means here. If you mean it's the same checks between the tries on line 124 and 163, they are not.

The try on 163 does not catch domain errors and continue running. Any exception it catches it re-throws.

CVODES throws a non-domain exception when the integration of the sensitivity states fails, triggering the initialization to terminate as expected from the current implementation.

It's a domain exception here.

You can change the code to:

auto start = std::chrono::steady_clock::now();
try {
  // we evaluate this with propto=true since we're                                                                                                        
  // evaluating with autodiff variables                                                                                                                   
  log_prob = stan::model::log_prob_grad<true, Jacobian>(
      model, unconstrained, disc_vector, gradient, &log_prob_msg);
} catch (const std::domain_error& e) {
  std::cout << "It's a domain error" << std::endl;
  if (log_prob_msg.str().length() > 0)
    logger.info(log_prob_msg);
  logger.info(e.what());
  throw;
} catch (const std::exception& e) {
  if (log_prob_msg.str().length() > 0)
    logger.info(log_prob_msg);
  logger.info(e.what());
  throw;
}
auto end = std::chrono::steady_clock::now();

And you'll see in the output when this problem comes up:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
It's a domain error
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

Anyway, this is the problem that I want to fix. These are recoverable errors and we should be recovering from them in the same way as we do already.

We should pick an initialization for the sampler where it is possible to both evaluate the log density and the gradients. Right now we do not check that we are able to evaluate the gradients. I do not think that is correct considering we need the gradients to do the sampling. That is the change I want to make over here.

bbbales2 commented 3 years ago

@betanalpha

wds15 commented 3 years ago

Bump. @betanalpha this looks all good to me. The intent to error out when the chosen initial leads to a nan for the log-likelihood or it's gradient is sensible, since the sampler needs both in order to move. I am happy to review the PR for this so that we get this into the release.

betanalpha commented 3 years ago

The intent to error out when the chosen initial leads to a nan for the log-likelihood or it's gradient is sensible, since the sampler needs both in order to move.

Yes, it is, which is why the code has always checked both.

Right now we do not check that we are able to evaluate the gradients. I do not think that is correct considering we need the gradients to do the sampling.

Yes, we do.

That is the change I want to make over here.

The description of this PR is incorrect.

Let me try to run through the code flow one more time:

For each of the 100 initialization attempts we check:

That the initial values are consistent with the parameter constraints.

If the failure is due to a domain error on the init var_context then a new attempt is made (L 149).

Otherwise the initialization fails immediately.

That we can evaluate log_prob for doubles at the initial value.

If the failure is due to a domain error then a new attempt is made (L 140).

Otherwise the initialization fails immediately.

That the returned log_prob is finite.

If it’s not finite then a new attempt is made (L 158).

That we can evaluate log_prob_grad for vars at the initial values.

If there is any failure in the evaluation of log_prob_grad then the initialization fails immediately.

If the Euclidean norm of the gradient is finite.

If it’s not finite then the initialization fails when it reaches the end of the loop body.

If all checks pass then the function returns before the end of the loop body (L 211).

Initialization returns successfully only if the both the value and gradient of the log target density can be evaluated.

All of this happens in one loop and, again, the double and var evaluations are checked for each attempt.

There are certainly a few changes that can be considered.

Adding a logger message when the gradient evaluation throws an exception beyond the log_prob message (L 170) and the exception message (L 171). Pull #3035 does not address this but instead tries to completely change the initialization evaluation logic.
Attempting new initializations under some gradient evaluation failures, namely domain errors. As I said before given the unforeseen behavior of CVODES errors when the initialization code was written this is reasonable but goes beyond the scope of this issue.

stan-dev / stan

If model evaluates at initial conditions without autodiff but fails with autodiff then the error messages are not great #3034

Summary:

Current Version: