Open SteveBronder opened 1 year ago
I'd like to see some more benchmarks on this before merging. For the model @mike-lawrence was using on the forums, I was able to re-create the following numbers
Matrix mode | Time (approximate) |
---|---|
AoS |
3 seconds |
SoA (2.32.1 release) |
12 seconds |
SoA (this branch) |
8 seconds |
So for a very large model with a lot of multi-indexing, this seems to be a win over the current implementation but still quite far off from the default (AoS
) implementation.
There are really two questions:
flat_map
or our own small implementation of the same) would be faster (curious, but not essential)Yes I'm going to writeup a google benchmark for this vs AoS soon. I think the big Q for me is at whether multi index assignment should always be turned off in the compiler. Those are like 3x slowdowns so I think there would need to be a lot of other work being done before SoA leads to an improvement here
I think the big Q for me is at whether multi index assignment should always be turned off in the compiler. Those are like 3x slowdowns
Given that we have to support multi-index assignment in the language. Is the issue just getting it efficient for varmat relative to matvar? How are you measuring? Also given that the advantage of varmat is downstream operation memory locality, not the assignment itself, how are you measuring?
Given that we have to support multi-index assignment in the language. Is the issue just getting it efficient for varmat relative to matvar?
Yep!
How are you measuring? Also given that the advantage of varmat is downstream operation memory locality, not the assignment itself, how are you measuring?
These are Qs I don't have answers to rn but I'm thinking about. It's hard! My first thought is varying sizes of multi-index assignments followed by a large multiplication then call to normal_lpdf.
My first thought is varying sizes of multi-index assignments followed by a large multiplication then call to normal_lpdf.
Me, too. That sounds completely reasonable. Of course, based on size of the matrix, results will vary, so it'll be good to report for a range of sizes.
So I've built a little repo for generally doing benchmarks with Stan and Stan math
https://github.com/SteveBronder/stan-perf
In there I have benchmarks for testing the current impl, using unordered_map
, and a hash table with open addressing and linear probing.
I'm running a benchmark that does a little more work rn, but for just a plain assignment multi index is pretty slow for SoA. The current version (set) is horrific, unordered map is about twice as good, and the custom map about twice as good as unordered map.
I took just the unordered map and custom map and looked at their average performance relative to matrix. It seems like it's about a 50% slowdown even with the custom map, but imo I think as long as the user is doing just a few operations the new memory scheme should still be an improvement. I'm running some benchmarks rn to test that.
Okay so after updating the tests to also do a little extra work at the end (dot product and addition) I think we should add the custom hash table.
https://github.com/SteveBronder/stan-perf/tree/main/benchmarks/varmat_multi_assign
Here are the updated raw plots where custom map is much faster after only doing a little extra work after the assignment.
And here is the ratio of times which shows the custom map being about 30% faster than matrix<var>
.
@bob-carpenter are you okay with me adding a paired down version of the hashmap from here? I'm not sure what folder it would go in?
In terms of bigger picture, we want matrices to work in the case where users set every entry. In fact, I think that's a very common use case for something like GPs (and why I want to get comprehensions in ASAP). I'm actually OK having struct of arrays (I thought that was what we'd settled on?) be constant and not user settable at all. That is, if a container gets modified, use array of structs.
Are there example use cases where a Stan program only sets a few elements of a larger matrix? Fiddling with the diagonal of a matrix maybe, but that's still N ops for an N x N matrix.
@bob-carpenter are you okay with me adding a paired down version of the hashmap from here?
Totally OK adding a custom map as long as it has sufficient testing.
I'm not sure what folder it would go in?
We can create something like a new utility class folder if there isn't something like that already. I guess that'd be inside prim
since this won't do any explicit autodiff, right?
In terms of bigger picture, we want matrices to work in the case where users set every entry. In fact, I think that's a very common use case for something like GPs (and why I want to get comprehensions in ASAP).
Yes I agree when @WardBrian gets back I think we will talk about how to do that
I'm actually OK having struct of arrays (I thought that was what we'd settled on?) be constant and not user settable at all. That is, if a container gets modified, use array of structs.
I think it's pretty easy to add the hash table and we should be able to allow for all the subsettting besides single index in a loop
Are there example use cases where a Stan program only sets a few elements of a larger matrix? Fiddling with the diagonal of a matrix maybe, but that's still N ops for an N x N matrix.
I think sometimes people will do things with contiguous slices of a matrix, like saying operating over each column. The speed for that stuff should be fine and this change should make the rest fine as well.
@bob-carpenter are you okay with me adding a paired down version of the hashmap from here?
Totally OK adding a custom map as long as it has sufficient testing.
Alrighty yeah the hashmap I used comes from this repo (MIT) licensed. It doesn't have any tests but is a well known table so I may kill two birds with one stone and add tests here and send a PR upstream to that PR with the tests
We can create something like a new utility class folder if there isn't something like that already. I guess that'd be inside prim since this won't do any explicit autodiff, right?
I think I'd keep the folder in the model folder since I'd rather not have to go back and forth between repos and this is the only place we use it atm. If that changes then we can always put it in the math library
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
When we first wrote the assignments for the new matrix type we didn't optimize very hard for
multi_index
assignments. The discourse thread here makes a case for using these so I decided to fix these up a bit.Instead of using a set to keep track of unique indices being assigned to we just use a map with the key being the index to assign to and the value being the linear index position to access the right hand side vector / matrix. I haven't benchmarked this but it should take this from O(N * ln(N)) to O(N). It also cuts out an N sized vector we used to originally store indices. It also cuts out an if statement we had to do in the loop which should be much nicer for the CPU
Intended Effect
Make multi index assignment more performant
How to Verify
All tests still pass for multi index assignment
Side Effects
None hopefully
Documentation
No doc changes
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: