Seg Fault when summing kernels in gp_pred_rng

stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

756 stars 188 forks source link

Seg Fault when summing kernels in gp_pred_rng #994

Closed drezap closed 6 years ago

drezap commented 6 years ago

Description

Whenever we try to sum kernels in a gp_pred_rng, we get a segmentation fault.

I've tried only using the predictive mean, and still, we get a seg fault.

Example

Here's an example of the kind of gp_pred_rng that will give a seg fault. I'm also attaching full Stan code and datasets, where if we sum kernels in the pred_rng, we experience the same issue every time:

  vector gp_pred_rng(vector[] x_pred,
                     vector y1, vector[] x,
                     real magnitude, real[] length_scale,
                     real sig,
                     real sigma) {
    int N = rows(y1);
    int N_pred = size(x_pred);
    vector[N_pred] f2;
    {
      {
        matrix[N, N] K = add_diag(gp_dot_prod_cov(x, sig) +
                                  gp_exp_quad_cov(x, magnitude, length_scale),
                                  sigma);
      }
      matrix[N, N] L_K = cholesky_decompose(K);
      vector[N] L_K_div_y1 = mdivide_left_tri_low(L_K, y1);
      vector[N] K_div_y1 = mdivide_right_tri_low(L_K_div_y1', L_K)';
      matrix[N, N_pred] k_x_x_pred = gp_dot_prod_cov(x, x_pred, sig) +
        gp_exp_quad_cov(x, x_pred, magnitude, length_scale);
      f2 = (k_x_x_pred' * K_div_y1);
    }
    return f2;
  }

Additional Information

There are no seg faults when summing kernels in the transformed parameters block when I take the kernels out of scope, only when I implement something which sums kernels for the posterior predictive distribution, in the function block.

These datasets are relatively small in size, there should not be a segmentation fault in these cases.

gp_regression.txt housing.txt logistic_gp.txt heart-disease-uci.zip

Current Math Version

v2.18.0

bob-carpenter commented 6 years ago

Data size can cause out of memory errors, but it shouldn't lead to a segfault.

Could you create a complete end-to-end example that fails in the way you're not expecting? It's not possible to call that _rng funtion in the transformed parameters block, so there must be a different set up when there aren't seg faults in transformed parameters.

Is there a problem if you assign to two matrices then add?

matrix[N, N_pred] a = gp_dot_prod_cov(x, x_pred, sig) +
matrix[N, N_pred] b = gp_exp_quad_cov(x, x_pred, magnitude, length_scale);
matrix[N, N_pred] ab = a + b;

P.S. When opening an issue please label it with deliverable and bug status and ideally with an assignee or a tag that it's a good first issue.

drezap commented 6 years ago

Thanks for the detailed response, I'll be sure to include all this when I open an issue next time.

A few things to note: This is not big data, N=303. Any time I sum kernels in the functions block, even with smaller data, I get this segfault.

1) I'm attaching 1 end-to-end example that will give the seg-fault. This can be run in command-stan with ./logistic_gp_segfault sample num_samples=200 num_warmup=200 data file=heart_disease_classification.data.R: logistic_gp_segfault.txt heart_disease_classification.data.R.txt and I've changed extensions to .txt so I could upload directly. The first one is stan code, so one will need to change the extension of the first file from .txt to .stan.

Is there a problem if you assign to two matrices then add?

Yep, here's another stan code attached where I get a segfault w/ this strategy. We can use the same data and a similar command as above. logistic_gp_segfault_bob.txt

P.S. When opening an issue please label it with deliverable and bug status and ideally with an assignee or a tag that it's a good first issue.

Will do! Thanks for the workflow feedback.

bob-carpenter commented 6 years ago

Thanks. Stan shouldn't be segfaulting no matter what happens. So we need to get to the bottom of this.

I updated everything to `develop, but I can't compile that file:

No matches for: 

  gp_dot_prod_cov(vector[], real)

Function gp_dot_prod_cov not found.
  error in '/Users/carp/temp2/drezap/gp-segfault-2.stan' at line 10, column 48
  -------------------------------------------------
     8:     vector[N_pred] f2;
     9:     {
    10:       matrix[N, N] k1 = gp_dot_prod_cov(x, sig);
                                                       ^
    11:       matrix[N, N] k2 = gp_exp_quad_cov(x, magnitude, length_scale);
  -------------------------------------------------

Are you working on a branch or something?

I'd suggest trying to debug what's going into those matrices k1 and k2 --- when you're done, are they actually N x N? You should be able to print them out or print out elements that aren't finite (using is_inf() and is_nan()). Multiplication itself shouldn't be causing a problem.

The other thing to do is go in and instrument the generated .hpp with std::cout directed print statements with std::endl to flush---then you can diagnose where the segfault arises.

You might also be able to detect some kind of problem with the types looking at the generated code. What's the return type for your functions?

drezap commented 6 years ago

I'm working a hacked local cmdstan-dev.... to get this to work you need to throw these lines in function signatures:

add("gp_dot_prod_cov", expr_type(matrix_type()), expr_type(double_type(), 1U), expr_type(double_type()));
add("gp_dot_prod_cov", expr_type(matrix_type()), expr_type(vector_type(), 1U), expr_type(double_type()));
add("gp_dot_prod_cov", expr_type(matrix_type()), expr_type(row_vector_type(), 1U), expr_type(double_type()));
// x1, x2
add("gp_dot_prod_cov", expr_type(matrix_type()), expr_type(double_type(), 1U), expr_type(double_type(), 1U), expr_type(double_type()));
add("gp_dot_prod_cov", expr_type(matrix_type()), expr_type(vector_type(), 1U), expr_type(vector_type(), 1U), expr_type(double_type()));

and then make the changes as in PR #980 (just copy and past the file).

I'll take your suggestions as to how to debug when I get the chance, much appreciated.

drezap commented 6 years ago

or likewise, just sum two gp_exp_quad_cov kernels, and you'll get the same issue. I'll figure it out when I put some time into it.

drezap commented 6 years ago

I traced the memory leak this to the function: hmc_nuts_diag_e_adapt.hpp where we have a recursive function, with no terminating condition, that is probably causing a stack overflow.

I'm taking a look at Hamiltonian Monte Carlo for Hierarchical Models and The No-U-Turn Sampler to see how the adaptive parameters are calculated, but these are likely outdated.

Can someone point me to a paper that describes the current state of the epsilon adaptation? (easier to read math and papers than chase the C++ around). I want to at least propose a solution, as it will help my understanding of HMC/NUTS.

bbbales2 commented 6 years ago

An exploding recursion should be easy to verify before digging into the mechanics. Can you run this in a debugger (build everything with -Og), run it in gdb or lldb, and then check the backtrace?

I'll be in Helsinki tomorrow afternoon. We can look at this then if you've got other stuff to do.

drezap commented 6 years ago

has absolutely nothing to with what I said above, just went out of index on a vector... addressing this in #980, as it's taken care of by vectorizing error checks