mlange-42 / arche

Arche is an archetype-based Entity Component System (ECS) for Go.
https://mlange-42.github.io/arche/
MIT License
131 stars 8 forks source link

question on benchmarks for array_of_struct #414

Open aktau opened 5 months ago

aktau commented 5 months ago

The graph in the README, with arche outperforming a bog-standard AoS setup really surprised me. The memory locality in the AoS case should be really good. To verify, I ran the benchmarks and found something closer in line with my expectations:

$ benchstat -col '/impl' ~/tmp/bench.arche
goarch: amd64
pkg: github.com/mlange-42/arche/benchmark/competition/array_of_structs
cpu: AMD Ryzen Threadripper PRO 3995WX 64-Cores
                         │     Arche     │            ArrOfPointers             │            ArrOfStructs             │              LinkedList               │
                         │    sec/op     │    sec/op      vs base               │    sec/op     vs base               │    sec/op      vs base                │
Fill/conf=16B_1_000-8      3301.0n ± ∞ ¹    503.0n ± ∞ ¹  -84.76% (p=0.008 n=5)   510.9n ± ∞ ¹  -84.52% (p=0.008 n=5)   1252.0n ± ∞ ¹   -62.07% (p=0.008 n=5)
Fill/conf=16B_10_000-8     32.415µ ± ∞ ¹   18.993µ ± ∞ ¹  -41.41% (p=0.008 n=5)   5.824µ ± ∞ ¹  -82.03% (p=0.008 n=5)   12.868µ ± ∞ ¹   -60.30% (p=0.008 n=5)
Fill/conf=16B_100_000-8    322.92µ ± ∞ ¹    55.82µ ± ∞ ¹  -82.71% (p=0.008 n=5)   58.83µ ± ∞ ¹  -81.78% (p=0.008 n=5)   130.03µ ± ∞ ¹   -59.73% (p=0.008 n=5)
Fill/conf=32B_1_000-8      3360.0n ± ∞ ¹    661.1n ± ∞ ¹  -80.32% (p=0.008 n=5)   557.0n ± ∞ ¹  -83.42% (p=0.008 n=5)   1365.0n ± ∞ ¹   -59.38% (p=0.008 n=5)
Fill/conf=32B_10_000-8     32.258µ ± ∞ ¹    6.797µ ± ∞ ¹  -78.93% (p=0.008 n=5)   6.585µ ± ∞ ¹  -79.59% (p=0.008 n=5)   14.178µ ± ∞ ¹   -56.05% (p=0.008 n=5)
Fill/conf=32B_100_000-8    319.83µ ± ∞ ¹    88.52µ ± ∞ ¹  -72.32% (p=0.008 n=5)   67.72µ ± ∞ ¹  -78.83% (p=0.008 n=5)   164.23µ ± ∞ ¹   -48.65% (p=0.008 n=5)
Fill/conf=64B_1_000-8      3407.0n ± ∞ ¹    877.6n ± ∞ ¹  -74.24% (p=0.008 n=5)   692.7n ± ∞ ¹  -79.67% (p=0.008 n=5)   1779.0n ± ∞ ¹   -47.78% (p=0.008 n=5)
Fill/conf=64B_10_000-8     32.483µ ± ∞ ¹    9.963µ ± ∞ ¹  -69.33% (p=0.008 n=5)   7.493µ ± ∞ ¹  -76.93% (p=0.008 n=5)   19.404µ ± ∞ ¹   -40.26% (p=0.008 n=5)
Fill/conf=64B_100_000-8    322.95µ ± ∞ ¹   106.15µ ± ∞ ¹  -67.13% (p=0.008 n=5)   84.89µ ± ∞ ¹  -73.72% (p=0.008 n=5)   204.36µ ± ∞ ¹   -36.72% (p=0.008 n=5)
Fill/conf=128B_1_000-8     3552.0n ± ∞ ¹    864.6n ± ∞ ¹  -75.66% (p=0.008 n=5)   667.0n ± ∞ ¹  -81.22% (p=0.008 n=5)   1818.0n ± ∞ ¹   -48.82% (p=0.008 n=5)
Fill/conf=128B_10_000-8     32.32µ ± ∞ ¹    11.96µ ± ∞ ¹  -63.00% (p=0.008 n=5)   10.64µ ± ∞ ¹  -67.07% (p=0.008 n=5)    26.00µ ± ∞ ¹   -19.55% (p=0.008 n=5)
Fill/conf=128B_100_000-8    319.7µ ± ∞ ¹    251.0µ ± ∞ ¹  -21.50% (p=0.008 n=5)   176.5µ ± ∞ ¹  -44.79% (p=0.008 n=5)    528.7µ ± ∞ ¹   +65.38% (p=0.008 n=5)
Fill/conf=256B_1_000-8     3459.0n ± ∞ ¹    892.2n ± ∞ ¹  -74.21% (p=0.008 n=5)   667.8n ± ∞ ¹  -80.69% (p=0.008 n=5)   1951.0n ± ∞ ¹   -43.60% (p=0.008 n=5)
Fill/conf=256B_10_000-8    32.567µ ± ∞ ¹   11.829µ ± ∞ ¹  -63.68% (p=0.008 n=5)   8.452µ ± ∞ ¹  -74.05% (p=0.008 n=5)   32.826µ ± ∞ ¹         ~ (p=0.151 n=5)
Fill/conf=256B_100_000-8    323.4µ ± ∞ ¹    610.1µ ± ∞ ¹  +88.64% (p=0.008 n=5)   574.7µ ± ∞ ¹  +77.68% (p=0.008 n=5)   1334.7µ ± ∞ ¹  +312.66% (p=0.008 n=5)
geomean                     32.90µ          10.83µ        -67.09%                 8.435µ        -74.36%                  21.55µ         -34.50%

Of note, I manually changes the benchmark names with a text editor so I could use benchstat column projection.

What did I do wrong?

mlange-42 commented 5 months ago

Hi @aktau, many thanks for your interest!

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks? Maybe we should turn the question around: what do you think I did wrong?

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

What is less clear to me is why there is no saturation effect with an increasing number of entities. Well maybe there is, but only for even larger numbers.

I also did the same benchmarks for a Rust ECS (rs-ecs), and qualitatively the results are the same.

Could you share your benchmarking code for a comparison?

aktau commented 5 months ago

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks?

I'm just running your code, unedited:

$ cd github/arche/benchmark/competition/array_of_structs
$ go test -test.bench=. -test.count=5 | tee ~/tmp/bench.arche
$ nvim ~/tmp/bench.arche # edit the benchmark names so that I can use the `-col '/impl'` flag, as stated above
$ benchstat -col '/impl' ~/tmp/bench.arche

Maybe we should turn the question around: what do you think I did wrong?

I don't think you did anything wrong. I was just surprised by the result.

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

mlange-42 commented 5 months ago

I'm just running your code, unedited:

Can you verify that the original output already shows what the benchstat table does?

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of. Most ECS implementations (archetype-based like Arche, but also sparse-set-based) use a memory layout more similar to SoA. However, the arrays (or buffers, or whatever) do not necessarily contain only primitive types, but components. A component can contain multiple variables/primitives, but for good performance they should be closely related and mostly accessed together.

You may want to take a look at the architecture section of the user manual for more details.

aktau commented 5 months ago

Can you verify that the original output already shows what the benchstat table does?

It does. I did not change the numbers. I only edited the output format so that benchstat cam render/compare them better (see the -col flag).

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of

I think that can explain the difference in performance. For very big structs, the SoA-like approach of arche would be superior, even taking the extra abstraction into account.

mlange-42 commented 5 months ago

Would still like to find out why your results differ that much. Could you share the raw results obtained on your machine?