rust-gamedev / wg

Coordination repository of the Game Development Working Group
504 stars 10 forks source link

Consider creating a game math library benchmark for the working group #93

Open kettle11 opened 3 years ago

kettle11 commented 3 years ago

Given that the working group recently took ownership of an ECS benchmark it seems appropriate to also have a game math library benchmark. Game math libraries are even more benchmarked and debated than ECS frameworks.

A benchmark from the working group provides a common point of reference everyone can contribute to on neutral ground. The goal is to provide useful information to help people make informed choices about the Rust ecosystem.

Benchmarks provided by the Working Group should aim to help people holistically people evaluate libraries. Ideally such a benchmark also includes metrics for compile times and perhaps lines of code (as a rough measurement of functionality and complexity).

@bitshifter, @sebcrozet, and @termhn have all created their own benchmarks, perhaps they have thoughts?

fu5ha commented 3 years ago

I'm currently working on updating benchmarks that @sebcrozet did in my own for of mathbench to include new ultraviolet features and also to try to run a more holistic test suite. As @bitshifter had mentioned in the main mathbench-rs repo, there should be benchmarks both for "wide" and "scalar" types, as both are important for different cases, so I'm trying to include tests that might benefit both cases, as well as individual benchmarks for each op for both cases.

On Tue, Aug 25, 2020, 11:24 AM Ian Kettlewell notifications@github.com wrote:

Give that the working group recently took ownership of an ECS benchmark it seems appropriate to also have a game math library benchmark. Game math libraries are even more benchmarked and debated than ECS frameworks.

A benchmark from the working group provides a common point of reference everyone can contribute to on neutral ground. The goal is to provide useful information to help people make informed choices about the Rust ecosystem.

Benchmarks provided by the Working Group should aim to help people holistically people evaluate libraries. Ideally such a benchmark also includes metrics for compile times and perhaps lines of code (as a rough measurement of functionality and complexity).

@bitshifter https://github.com/bitshifter, @sebcrozet https://github.com/sebcrozet, and @termhn https://github.com/termhn have all created their own benchmarks, perhaps they have thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-gamedev/wg/issues/93, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGYXHZ5DZMVMSGFIU5FSDDSCP6UHANCNFSM4QK6PI5A .

bitshifter commented 3 years ago

I'd be happy for the working group to take ownership of mathbench. I think it's been useful to the community but it's usually way down my list of things to work on when I have free time so it's a bit unloved.

It would be good to get the wide nalgebra/ultrawide in the same repo with the scalar benches intact as well as @termhn mentioned (see https://github.com/bitshifter/mathbench-rs/issues/21).

If the working group were to take ownership of the code, I think they would also need to take ownership of publishing the results and updating them periodically when existing libraries are updated or new libraries are added. I publish results to my github site https://bitshifter.github.io/mathbench/0.3.0/report/index.html, @sebcrozet and @termhn have published their own results to their own blogs/READMEs. I think it would be good if there was a central location for keeping these.

The other thing to do when publishing results is update the summary in the README and document the hardware and OS used to generate them. I also make a tag when publishing the results, so it's easy to see what lib versions were used to generate them. I've consistently used the same hardware, an old laptop of mine. However that laptop doesn't support AVX-512 so it couldn't run some of the wide benchmarks. It's probably not the end of the world if hardware changed between publishing runs, but it would be better if it didn't.

They take a long time to run and you can't really use the machine for anything else while they are running, which is another reason I haven't really been updating them.

AlexEne commented 3 years ago

@bitshifter Can you add this information in the repo somwehere? a sort of Contributing.md for maintainers where this is described, so it's not lost in this issue. I also have some questions on the hardware that should be used for these, do you run them on your machine or some EC2 instances? (sorry if this was already discussed in meetings but I can never make it to a wg meeting).

On a more meta-level should we wait for @termhn 's proposed changes to land before moving it? Who has bandwidth to help from the WG with this? (assign this to yourself and ping me for permissions if needed).

bitshifter commented 3 years ago

Sure, I can document guidelines for publishing results.

Hardware wise I generally run on my own laptop. I think it's useful using the same hardware each time I update it. The downside is the machine is 5 years old and doesn't have recent CPU features that some libraries want to take advantage of. I have not investigated a cloud solution. Sounds like in theory providing they can guarantee that nothing else is using resources when mathbench is running and the hardware is known and consistent.

I don't know if @termhn intended to try get these changes back into mathbench or to keep them as a fork? It is probably a bit of work to get those changes back in the main repo just because they were quite extensive.

On that note I recently updated mathbench to include ultraviolet. I was holding out until I'd added wide support but I still haven't found the time to do that and I had a PR to add another library so it seemed like I may as well add ultraviolet at the same time. The ultraviolet support is mostly based on @termhn 's fork (without 0.6-pre changes).

I would still like to add wide tests. I'd like to keep them separate but have one of the scalar libs running the test for comparisons. I was possibly going to take a slightly different approach to what @sebcrozet did in his fork - which was to have a bench with say 100 elements in it and they run it through different width types, rather than having a bench for each type width, if that makes sense? I was thinking of producing separate scalar and wide summary tables. The current scalar summary table is getting pretty huge on its own.

@sebcrozet's fork also added a lot of benches for types that other libraries don't generally have, which is fine, my original intention for mathbench was kind of comparison of the lowest common denominator of math library features. In some sense there's no harm in adding "exotic" features, it's just there won't be much to compare them against so maybe they're not so useful to be in "official" repo?

I think there is some sense in people forking mathbench and adding benches that make sense for their library, or compiler flags that makes sense for their library. I see no harm in that.

fu5ha commented 3 years ago

Oh nice... I'll probably try to "rebase" my work on top of your current mathbench then @bitshifter

fu5ha commented 3 years ago

I was possibly going to take a slightly different approach to what @sebcrozet did in his fork - which was to have a bench with say 100 elements in it and they run it through different width types, rather than having a bench for each type width, if that makes sense?

If I understand what you mean, that would mean that every type would do the same number of total iterations (and as such, wide types would be doing more total valued processed, but the same number of ops)... if so, I'm not sure I really like that way as I think it sorta obfuscates the higher throughput and makes it harder to reason about? Of course the current method isn't perfect as it's assuming you are able to start and end in wide types for your algorithm which isn't always true, but I think it's still a valid case to test (it's how I use ultraviolet in rayn) and it makes it easier to compare the throughputs like I said before.

I was thinking of producing separate scalar and wide summary tables. The current scalar summary table is getting pretty huge on its own.

Yeah makes sense to me.

bitshifter commented 3 years ago

If I understand what you mean, that would mean that every type would do the same number of total iterations (and as such, wide types would be doing more total valued processed, but the same number of ops)... if so, I'm not sure I really like that way as I think it sorta obfuscates the higher throughput and makes it harder to reason about? Of course the current method isn't perfect as it's assuming you are able to start and end in wide types for your algorithm which isn't always true, but I think it's still a valid case to test (it's how I use ultraviolet in rayn) and it makes it easier to compare the throughputs like I said before.

No, not total number of iterations. I'm suggesting the same number of inputs are used for each type. Wider types would be doing less iterations because they are processing 4, 8 or 16 elements at a time. So say 100 single input Vec3's so glam would process 1 at a time, an 32fx4 type would process 4 and a time, 32x8 would process 8 at a time and so on. So it should make the throughput advantage or wide types clearer I think?

What it doesn't show is the timing of a single function call for each wide type (like how long does a single Vec3x4::dot take) which is what most of the scalar benches are trying to achieve (i.e. the scalar benches give a good idea of the cost of Vec3::dot for example).

I feel like the using the same input size would give a better example of the throughput advantage of the wider types though. I could add both single calls and throughput benches. It's just more to write and takes longer to run.

fu5ha commented 3 years ago

I don't see how that is different than the way @sebcrozet implemented it (though I could just not be understanding still of course 😅)

I agree with you though, afaict

bitshifter commented 3 years ago

It's probably no different, I'm not super familiar with his fork :) The main thing is I would keep the existing scalar benches and wouldn't have all of the scalar types run the wide benches except for maybe 1 for comparison. Mostly because I think there's limited value in it for the scalar libs and it adds to the time to run the benches.

fu5ha commented 3 years ago

One more thing... I currently have a couple benches implemented with f64 and f32 versions of wide types, but at this point I'm not sure it's actually worth it to do that to be honest. Think I'm gonna rip that out and just keep it consistently f32 across the board

Lokathor commented 3 years ago

Ralith will cry

fu5ha commented 3 years ago

Well, there's not gonna be benches for scalar f64 across the board anyway so 😅

as far as all the current benchmarks go, any perf trends that are true of f32s are basically true of f64s, just f64s are like 3x slower across the board or something

bitshifter commented 3 years ago

I don't have a problem with dropping f64. Potentially someone could add them back at a later date for people who want it. Hopefully the existing macros could be used for f64 benches since they should be largely the same as the f32 ones.

fu5ha commented 3 years ago

https://github.com/termhn/mathbench-rs/blob/wide/benches/eulerbench.rs Here's the approach I'm taking that I think I will just copy out to the other benches basically

bitshifter commented 3 years ago

Sounds good to me.

I was thinking of passing a stride to the euler_bench macro and having the macro handle the &(size / 8) bit, just to streamline things a bit more. Also I think @sebcrozet's version used &((*size as f32 / 8.0).ceil()) which would make sure input wouldn't be truncated if the size wasn't a multiple of the stride. Kind of verbose which was another reason I think it would be good if could be handled by macros.

Note that a lot of the existing benches don't take a size parameter so you'll need to make a version of them that can handle that. Fairly easy to do, just repetitious.

fu5ha commented 3 years ago

opened bitshifter/mathbench-rs#24