Closed spenczar closed 1 year ago
Thanks! FWIW, we've tested it on ARM, but did most of the optimizations for x86 because we thought that might be most relevant for "production use". In fact, I undid some optimizations because they were faster on ARM, but slower on x86. A lot has to do with cache efficiency, which is hard to predict. Just reordering some instructions can have a significant effect.
Right! I did not expect it to outperform the other recorded benchmarks so dramatically, that was a pleasant surprise.
I am happy to continue updating these benchmarks as new versions are released, or I can close this PR and let this PR itself be a snapshot of historical performance.
Or, a more realistic environment could be on a variety of Google Cloud instance classes, which I could do if you like.
The benchmarks that've added so far run on stand-alone computing nodes. I don't think laptop or cloud based benchmarks are super useful in the long run. We'll never know what has changed on the hardware or software side if we run it in a virtual machine, if some other user is using the same physical CPU. For example, some cloud providers disable certain CPU features (AVX). It's all pretty random. And on a laptop/desktop it will depend on what other processes are running, temperature, battery status etc.
It's true that you end up with a bunch more noise from multitenancy, but a way I have dealt with that in the past is to do lots more benchmark runs, and throw out slow outliers. It can make sense to only care about the minimum observed time, too.
My real agenda is that I have a few performance-minded changes that I would like to measure. In particular, I think there are some spots where we could do less work expanding SPICE kernel polynomials because all we end up using is position data, not velocity or acceleration. In some casual profiling, those polynomial expansions seemed to dominate my workload. But it might be better to just improve ephem caching - its hard to say without measurements.
So anyway, finding some baseline that I can run would be really useful for making performance-oriented changes. Do you have thoughts on a good way to approach that?
Cool. I'd say just test it on whichever machine you plan on using it. Once you've found what seems to be a good optimization, send us a pull request and we'll run it on the machines we've been using before.
(Ideally, we'd setup our own GitHub Actions runners on dedicated machines to always provide a benchmark with every push - but it's just overkill for this sort of small project)
That sounds great to me. I'll close this.
This is an Apple MacBookPro18,2. It has a 10-core M1 Max chip with 32GB of memory. rebound and assist were compiled with clang-1400.0.29.202 for arm64-apple-darwin21.6.0.
The results are frighteningly fast. I think it's safe to say that this code works well on ARM!