Open SteveBronder opened 5 years ago
Cool. How did you run the optimization checker?
Templates don't play nicely with OO, so I don't see how we could add a template parameter to the stack_alloc class. It can't go all the way to CRTP because something needs virtual dispatch in the reverse pass.
Also, there's no way to deal with the container storage that works with getting/setting elements. You wind up having to rewrite the entire object every time there's a set operation.
Cool. How did you run the optimization checker?
For gcc I add -fopt-info-missed='missed.log'
to CXXFLAGS and it prints all the misses to a little file. after that I just grep'd for stan math specific ones
But yeah I need to reread your article from above and think about this more
Templates don't play nicely with OO, so I don't see how we could add a template parameter to the stack_alloc class. It can't go all the way to CRTP because something needs virtual dispatch in the reverse pass.
I'll think about this, swear I have done this before when I wanted to make our stack allocator compliant with the C++20 allocator concept
Description
I ran gcc's missed optimizer reporter on the arK example in
stat_comp_benchmarks
. The below contains all the stan math related optimization misseshttps://gist.github.com/SteveBronder/eeb870bd25e9cb5c10d6e7bff44d24bb
In particular the optimization misses like below were interesting to me
For operators tend to be defined pretty similarly, below is the op for addition
The main miss we have in the above is that
vari
stores it's adj and val next to each other on the stack allocator when we really just need access to the adj values. This is a known issue, Bob has a v nice post related to this on discourse, but I wonder if there is a dumber approach which could resolve this particular issue.What if we added a template to
stack_alloc
(or maybevari
?) that defined the storage method we wanted. So say we defined enumsjoint
,val
, andadj
wherejoint
places theval_
andadj_
together on the same stack andval
/adj
places theval_
on one stack andadj_
on another stack. Would that help for chain methods like in precomputed_vari? It feels like it would give us better memory locality, if we doc'd it well idt this style would be that bad, and we could just default the template value tojoint
so thatwould get better memory locality and hit these optimizations for SIMD instructions
Expected Output
better memory locality
Current Version:
v3.0.0