Missed optimizations in precomputed gradients and operators for vari

SteveBronder commented 5 years ago

Description

I ran gcc's missed optimizer reporter on the arK example in stat_comp_benchmarks. The below contains all the stan math related optimization misses

https://gist.github.com/SteveBronder/eeb870bd25e9cb5c10d6e7bff44d24bb

In particular the optimization misses like below were interesting to me

stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not suitable for gather load _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: bad data references.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:73:33: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5 = *_4;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _10 = *_9;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _6 = _5->adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _5->adj_ = _12;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not consecutive access _7 = this_16(D)->D.202934.adj_;
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:72:26: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/precomputed_gradients.hpp:75:3: note: not vectorized: not enough data-refs in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _8 = _1->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _1->adj_ = _10;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _5 = _2->adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _2->adj_ = _7;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _6 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not consecutive access _9 = this_13(D)->D.211540.D.211444.adj_;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:24:18: note: not vectorized: no grouped stores in basic block.
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _1->adj_ =  Nan;
stan/lib/stan_math/stan/math/rev/core/operator_addition.hpp:21:18: note: not consecutive access _2->adj_ =  Nan;

For operators tend to be defined pretty similarly, below is the op for addition

namespace internal {
class add_vv_vari : public op_vv_vari {
 public:
  add_vv_vari(vari* avi, vari* bvi)
      : op_vv_vari(avi->val_ + bvi->val_, avi, bvi) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bvi_->val_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
      bvi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
      bvi_->adj_ += adj_;
    }
  }
};

class add_vd_vari : public op_vd_vari {
 public:
  add_vd_vari(vari* avi, double b) : op_vd_vari(avi->val_ + b, avi, b) {}
  void chain() {
    if (unlikely(is_any_nan(avi_->val_, bd_))) {
      avi_->adj_ = std::numeric_limits<double>::quiet_NaN();
    } else {
      avi_->adj_ += adj_;
    }
  }
};
}  // namespace internal

The main miss we have in the above is that vari stores it's adj and val next to each other on the stack allocator when we really just need access to the adj values. This is a known issue, Bob has a v nice post related to this on discourse, but I wonder if there is a dumber approach which could resolve this particular issue.

What if we added a template to stack_alloc (or maybe vari?) that defined the storage method we wanted. So say we defined enums joint, val, and adj where joint places the val_ and adj_ together on the same stack and val/adj places the val_ on one stack and adj_ on another stack. Would that help for chain methods like in precomputed_vari? It feels like it would give us better memory locality, if we doc'd it well idt this style would be that bad, and we could just default the template value to joint so that

  void chain() {
    for (size_t i = 0; i < size_; ++i) {
      varis_[i]->adj_ += adj_ * gradients_[i];
    }
  }

would get better memory locality and hit these optimizations for SIMD instructions

Expected Output

better memory locality

Current Version:

v3.0.0

bob-carpenter commented 5 years ago

Cool. How did you run the optimization checker?

Templates don't play nicely with OO, so I don't see how we could add a template parameter to the stack_alloc class. It can't go all the way to CRTP because something needs virtual dispatch in the reverse pass.

Also, there's no way to deal with the container storage that works with getting/setting elements. You wind up having to rewrite the entire object every time there's a set operation.

SteveBronder commented 5 years ago

Cool. How did you run the optimization checker?

For gcc I add -fopt-info-missed='missed.log' to CXXFLAGS and it prints all the misses to a little file. after that I just grep'd for stan math specific ones

But yeah I need to reread your article from above and think about this more

Templates don't play nicely with OO, so I don't see how we could add a template parameter to the stack_alloc class. It can't go all the way to CRTP because something needs virtual dispatch in the reverse pass.

I'll think about this, swear I have done this before when I wanted to make our stack allocator compliant with the C++20 allocator concept

stan-dev / math