renaissance-benchmarks / renaissance

The Renaissance Benchmark Suite
https://renaissance.dev
GNU General Public License v3.0
299 stars 55 forks source link

Philosophers: Inverse scalability "problem" #440

Open shipilev opened 1 month ago

shipilev commented 1 month ago

We have been studying the performance of philosophers on large machines, and realized that the number of CPUs on the machine selects the number of philosophers in the benchmark.

This means that machines that run different number of CPUs run different workloads, misleading the cross-hardware comparisons. AFAICS, this is not what usual benchmarks do: in most benchmarks, higher available hardware parallelism performs globally same amount of work, either showing improvement due to parallelism, or degradation due to contention. In philosophers, adding hardware parallelism just makes benchmark slower, because the global amount of work is larger, on top of usual contention effects.

The easy way to demonstrate this is overriding -XX:ActiveProcessorCount=# on a large 64-core machine:

$ shipilev-jdk21u-dev/build/linux-aarch64-server-release/images/jdk/bin/java -Xmx4g -Xms4g -XX:+AlwaysPreTouch -XX:+UnlockDiagnosticVMOptions -XX:ActiveProcessorCount=... -jar renaissance-jmh-0.15.0.jar Philosophers -f 5 -wi 5 -i 5

ActiveProcessorCount=1:    230.081 ± 12.516 ms/op
ActiveProcessorCount=2:   1570.336 ± 75.888 ms/op
ActiveProcessorCount=4:   1893.643 ± 85.768 ms/op
ActiveProcessorCount=8:   2466.867 ± 114.564 ms/op
ActiveProcessorCount=16:  3374.587 ± 182.243 ms/op
ActiveProcessorCount=32:  5097.616 ± 330.096 ms/op
ActiveProcessorCount=64: 10788.201 ± 1470.015 ms/op

(The benchmark also trashes hard when all CPUs are busy, but I think that is just a way it works.)

I don't have a good solution for this, except maybe setting the number of philosophers at some fixed value.

lbulej commented 1 month ago

Thanks for bringing it up! I don't have a good solution either, partly because we may be using Renaissance differently — we compare data from systems with identical hardware configurations (including CPU count), and we are not looking at how repetition times change with the number of available processors, but how repetition times change due to changes in JVM.

We have previously limited parallelism in most Spark benchmarks because we observed serious trashing when run on 64-core machines and we don't have a common workload scaling strategy for the suite (yet). While we consider that to be a drawback that we would like to fix (many workloads won't scale beyond certain number of cores and need more work to keep the extra cores busy doing sensible things), it might go against your expectation that the amount of work should stay the same regardless of the amount of available resources.

For us the purpose of workload scaling is to reduce contention when more hardware resources are available, but I'm not sure if that's how the scaling in the philosophers benchmark works. Let's see if your data can provide any indication. Ignoring the contention-less 1-thread case in your data, I would expect the benchmark to slow down in completing work units (500000 successful acquires of both resources), so I simply divide the completion times by the number of threads (work units):

total time (ms) threads = work units time per unit (ms)
1570 2 785
1893 4 473
2467 8 308
3375 16 211
5098 32 159
10788 64 169

The time for completing a work unit keeps decreasing until the number of threads hits 64, where the increased workload probably increases contention (on average). There is probably enough wobble in the timing that this case may still be considered to scale (not great, but still). Would you agree with that interpretation or did I make some gross oversimplification?

That said, it may be a good idea to fix or bound the number of philosophers, but it may take us some time until we come up with an actual number to put in a release. Normally, I would suggest to override the thread_count parameter on the command line (e.g--override thread_count=16) to get a fixed number of philosophers regardless of the hardware configuration, but that's not supported with the JMH harness. Would it help have to have at least some way (i.e., JVM system properties) to override benchmark configuration parameters?

Alternatively, we could fix the number of threads in the jmh configuration of the benchmark, which is used by the JMH harness. The default configuration is currently the same as the default configuration — mainly because so far nobody asked for different settings for this benchmark when run using the JMH harness.