`ranef` function for accumulating random effects without getting temporaries

bbbales2 commented 3 years ago

Description

For a linear model that looks like:

vector mu = X * b + x1 .* r1[i1] + x2 .* r2[i2] + x3 .* r3[i3] + x4 .* r4[i4] + ...

we can get a speed boost by doing all the random effects terms together and avoiding temporaries from the multi-indexes (r1[i1]), the elementwise multiplications, and the additions.

With this model: https://github.com/bbbales2/computations_in_glms/blob/master/benchmark_model/mrp_varmat.stan

Replacing the code:

  vector[N] mu_varmat = Intercept + Xc * b_varmat
    + r_age_varmat[J_age]
    + r_educ_varmat[J_educ]
    + r_educ_age_varmat[J_educ_age]
    + r_educ_eth_varmat[J_educ_eth]
    + r_eth_varmat[J_eth]
    + r_male_eth_varmat[J_male_eth]
    + r_state_varmat[J_state];

with:

vector[N] mu_varmat = Intercept + Xc * b_varmat +
    ranef(J_age, r_age_varmat, J_educ, r_educ_varmat, J_educ_age, r_educ_age_varmat, J_educ_eth, r_educ_eth_varmat, J_eth, r_eth_varmat, J_male_eth, r_male_eth_varmat, J_state, r_state_varmat);

Makes the per-gradient varmat timings go from about 9ms on my computer to about 7ms when using the mrp_all.json dataset in that repo. There's a discussion over here of a fancier ranef function.

The very rough c++ implementation of ranef is:

template <typename T,
          require_var_matrix_t<T>* = nullptr>
inline T ranef(const std::vector<int>& i1, const T& v1,
               const std::vector<int>& i2, const T& v2,
               const std::vector<int>& i3, const T& v3,
               const std::vector<int>& i4, const T& v4,
               const std::vector<int>& i5, const T& v5,
               const std::vector<int>& i6, const T& v6,
               const std::vector<int>& i7, const T& v7,
               std::ostream*) {
  check_matching_sizes("dot_product", "i1", i1, "i2", i2);

  typename T::value_type res_val(i1.size());

  arena_t<std::vector<int>> ai1(i1.size());
  arena_t<std::vector<int>> ai2(i2.size());
  arena_t<std::vector<int>> ai3(i3.size());
  arena_t<std::vector<int>> ai4(i4.size());
  arena_t<std::vector<int>> ai5(i5.size());
  arena_t<std::vector<int>> ai6(i6.size());
  arena_t<std::vector<int>> ai7(i7.size());

  for(size_t n = 0; n < i1.size(); ++n) {
    res_val.coeffRef(n) = v1.val().coeffRef(i1[n] - 1) +
      v2.val().coeffRef(i2[n] - 1) +
      v3.val().coeffRef(i3[n] - 1) +
      v4.val().coeffRef(i4[n] - 1) +
      v5.val().coeffRef(i5[n] - 1) +
      v6.val().coeffRef(i6[n] - 1) +
      v7.val().coeffRef(i7[n] - 1);
    ai1[n] = i1[n];
    ai2[n] = i2[n];
    ai3[n] = i3[n];
    ai4[n] = i4[n];
    ai5[n] = i5[n];
    ai6[n] = i6[n];
    ai7[n] = i7[n];
  }

  T res = res_val;

  reverse_pass_callback([v1, ai1,
                         v2, ai2,
                         v3, ai3,
                         v4, ai4,
                         v5, ai5,
                         v6, ai6,
                         v7, ai7, res]() mutable {
    for(size_t n = 0; n < ai1.size(); ++n) {
      double adj = res.adj().coeffRef(n);
      v1.adj().coeffRef(ai1[n] - 1) += adj;
      v2.adj().coeffRef(ai2[n] - 1) += adj;
      v3.adj().coeffRef(ai3[n] - 1) += adj;
      v4.adj().coeffRef(ai4[n] - 1) += adj;
      v5.adj().coeffRef(ai5[n] - 1) += adj;
      v6.adj().coeffRef(ai6[n] - 1) += adj;
      v7.adj().coeffRef(ai7[n] - 1) += adj;
    }
  });

  return res;
}

@t4c1 should we implement a ranef function like this? Or can we achieve the same thing transparently with expressions? Is the answer will be different for mat<var> and var<mat>?

Current Version:

v3.4.0

SteveBronder commented 3 years ago

How would we generalize this? One issue I see here is that we are going to need either a lot of explicit functions or some clever tuples and parameter packs scheme. I think an alternative approach here would be to allow sparse data into the language. Then all this sorting would just be done in the sparse matrix and the actual function itself would stay the same.

data {
int N;
int M;
int K;
int[K] nonzero_rows;
int[K] nonzero_cols;
sparse_matrix[N, M, nonzero_rows, nonzero_cols] X_onehot;
}
parameters {
  vector[M] group_alpha;
}
transformed parameters {
vector[N] mu = X_onehot * group_alpha;
}

wds15 commented 3 years ago

I always thought this is why we want an efficient sparse matrix stuff. The looping is just a hack and sparse matrix is the much better alternative here.

bbbales2 commented 3 years ago

How would we generalize this?

Well there are a lot of options, that's part of the question. We could do something like the hierarchy syntax here or try to get some custom syntax here.

Maybe we could get around new user-facing functions or syntax by trying to automatically recognize terms with the compiler (and just forward compatible things to a special function) or maybe we could do it all with expressions (which is what I was asking @t4c1 about).

an alternative approach here would be to allow sparse data into the language The looping is just a hack and sparse matrix is the much better alternative here.

I don't see these things as competing and definitely don't think of either thing as a hack. I like having my variables broken out explicitly and not packed into one big vector. I have also heard from someone who prefers doing their random effects with the sparse matrix stuff that is currently in Stan.

As far as Stan's use as an intermediate language, those use cases can do whatever is faster. But as far as Stan models used directly, both methods have different merits.

As far as performance, they're doing very similar things, and will benefit from similar optimizations. I vaguely remember that the current sparse implementation is regarded as inefficient because it is full of checks and stuff (implementation here).

To get a handle on this, I wrote and benchmarked the above model with a sparse matrix multiply that doesn't do any checks. Here is the code I used. The signature is meant to match the compressed row storage multiply in Stan currently:

template <typename E, typename T,
          require_eigen_t<E>* = nullptr,
          require_var_matrix_t<T>* = nullptr>
inline T csr_matrix_times_vector_varmat(int N,
               int L,
               const E& w,
               const std::vector<int>& v,
               const std::vector<int>& u,
               const T& b,
               std::ostream*) {
  typename T::value_type res_val(N);

  arena_t<E> aw(w.size());
  arena_t<std::vector<int>> av(v.size());
  arena_t<std::vector<int>> au(u.size());

  size_t idx = 0;
  for(size_t n = 0; n < N; ++n) {
    double val = 0.0;
    size_t L = u[n + 1] - u[n];
    for(size_t l = 0; l < L; ++l) {
      val += w.coeff(idx) * b.val().coeff(v[idx] - 1);
      aw.coeffRef(idx) = w.coeff(idx);
      av[idx] = v[idx];
      idx++;
    }
    au[n] = u[n];
    res_val.coeffRef(n) = val;
  }
  au[au.size() - 1] = u[u.size() - 1];

  T res = res_val;

  reverse_pass_callback([N, aw, av, au, b, res]() mutable {
    size_t idx = 0;
    for(size_t n = 0; n < N; ++n) {
      double adj = res.adj().coeffRef(n);
      size_t L = au[n + 1] - au[n];
      for(size_t l = 0; l < L; ++l) {
        b.adj().coeffRef(av[idx] - 1) += aw.coeff(idx) * adj;
        idx++;
      }
    }
  });

  return res;
}

I ran 100 warmup and 100 draws of versions of the mrp model over here. There is a version of the model mrp_sparse.stan there that uses a sparse matrix vector multiply. There's code at the end for converting the random effects data to sparse matrix data.

Using the mrp_all.json data, a varmat implementation was about 9.2ms per gradient, the ranef implementation was 6.7ms per gradient, and a sparse matrix implementation (different than the model in the repo -- built using the code above w/ varmats) was 7.9ms per gradient.

The sparse matrix is presumably slower here because it's shuffling an extra variable through memory (the w vector -- since these are random effects with no intercepts the ranef implementation avoids that) and the dot product is a loop instead of being super unrolled code like ranef is.

Both codes would get faster if data did not need saved into the arena for the reverse mode pass. Presumably this could be handled in both cases by storing variables from data and transformed data in arena variables so the code knows they'll be available on the reverse pass without copying.

With a sparse matrix type we could probably avoid ever doing bounds checks after the data is read in. With ranef I think we're always bounds checking.

In both cases, for additional performance we might consider lumping these things into functions like the glm s in Stan currently where we compute gradients on the forward pass instead of doing them on the reverse pass (and by doing that avoid having to read through the data twice or worry about saving it to a stack).

Anyway that's a lot of stuff to say that I think these methods are complementary so I don't think sparse matrices outmodes this or anything.

Here is code for building the sparse versions of the data from the non-sparse versions:

make_sparse_data = function(data) {
  L = data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$N_male_eth + data$N_state
  NNZ = 7 * data$N
  w = rep(1.0, NNZ)
  v = c()
  u = rep(0, data$N + 1)
  for(n in 1:data$N) {
    v = c(v, c(data$J_age[n],
            data$N_age + data$J_educ[n],
            data$N_age + data$N_educ + data$J_educ_age[n],
            data$N_age + data$N_educ + data$N_educ_age + data$J_educ_eth[n],
            data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$J_eth[n],
            data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$J_male_eth[n],
            data$N_age + data$N_educ + data$N_educ_age + data$N_educ_eth + data$N_eth + data$N_male_eth + data$J_state[n]))
    u[n] = 7 * (n - 1) + 1
  }
  u[data$N + 1] = NNZ + 1

  list(N = data$N,
       y = data$y,

       K = data$K,
       X = data$X,

       L = L,

       NNZ = NNZ,
       w = w,
       v = v,
       u = u,

       N_age = data$N_age,
       N_educ = data$N_educ,
       N_educ_age = data$N_educ_age,
       N_educ_eth = data$N_educ_eth,
       N_eth = data$N_eth,
       N_male_eth = data$N_male_eth,
       N_state = data$N_state,

       prior_scale_sd = data$prior_scale_sd,
       prior_scale_Intercept = data$prior_scale_Intercept,
       prior_scale_b = data$prior_scale_b)
}

Edit: Oh yeah here's the Eigen sparse matrix vector multiply for comparison: https://github.com/libigl/eigen/blob/master/Eigen/src/SparseCore/SparseDenseProduct.h#L95

There's a lot less fancy stuff. The random memory access makes it hard to do anything.

andrjohns commented 3 years ago

I like this idea, so I put together a rough implementation for a generalised approach: https://godbolt.org/z/foP8xY

It's not a reverse-mode specialisation like Ben's above, but it's more flexible in both the number of inputs and the function that can be applied. I've instead made an indexed_apply functor which will apply a specified function to the inputs at the given indexes.

So let's say you have four vectors, and each has an std::vector of indexes. These indexes just need to be packed into an std::vector and a functor for summing the arguments specified:

Eigen::VectorXd result = indexed_apply(add_fun, {inds1, inds2, inds3, inds4}, vec1, vec2, vec3, vec4);

Using this approach, the ranef function can just be a wrapper that internally specifies the addition functor, so the user would just call:

vector[N] mu_varmat = Intercept + Xc * b_varmat +
    ranef({J_age, J_educ, J_educ_age, J_educ_eth},
          r_age_varmat, r_educ_varmat, r_educ_age_varmat, r_educ_eth_varmat);

EDIT: here's an example showing a ranef implementation in use: https://godbolt.org/z/KzenTv

t4c1 commented 3 years ago

@bbbales2 I think your implementation in the top post is the best we can do performance-wise. We just need to add variadic templates to accept arbitrary many of (values, indices) pairs. Personally I would prefer values to be before indices in each pair. Also I like the name hierarchy better that ranef. If you have any problems implementing the variadic templates version I can take over.

t4c1 commented 3 years ago

We can not use expressions here, as Eigen version we are using does not support indexing with vectors yet. Even if we used a version that does support that I don't think it would be any faster, as indexing makes memory accesses non-contiguous, preventing vectorization.

SteveBronder commented 3 years ago

@bbbales2 how did you get the per gradient timings in the model?

bbbales2 commented 3 years ago

@andrjohns that is neat that indexed_apply can do this.

@t4c1 bummer on expressions. I guess we'd have to do something in the compiler to recognize the pattern, extra syntax, or just have a plain-ol' giant function.

@SteveBronder some variation of:

f1 = read_stan_csv("~/cmdstan-develop/output1.all.csv")
get_elapsed_time(f1)[, "sample"] / sum(get_num_leapfrog_per_iteration(f1))

You have to run it a while to avoid timing the overhead. In this case I was timing 100 post-warmup draws of the mrp_all model.

bbbales2 commented 3 years ago

Also to be clear, this isn't something I want to try to work out for 2.26 or anything. Just something I wanted to benchmark and revisit since the varmat stuff is coming alive.

stan-dev / math

`ranef` function for accumulating random effects without getting temporaries #2237

Description

Current Version: