This PR adds support for padding to avoid false sharing.
OCaml Stdlib recently got added a way allocate cache-aligned atomics. To mirror the development, the Loc module is extended with a make_contended function. However, the approach in this PR is to add an optional boolean argument padded to relevant constructor functions, i.e. Loc.make ~padded:true.
Why not name the argument contended? False sharing is a specific kind of contention. However, you can have contention even when you don't have false sharing simply because the same location is being accessed by multiple domains / cores. Using make_contended doesn't help with that at all and the term contended is potentially misleading.
This PR also makes data structures use padded locations and also pad other parts where it should be advantageous using multicore-magic.
For background, false sharing has absolutely nothing to do with atomics per se. False sharing happens whenever there is at least one core writing and other cores accessing (reading or writing) something that happens to map to the same cache line (aligned region of memory). It doesn't have to involve atomics at all. The word being written to could just be a regular mutable record field (to which no other core has access to) and the other cores might just be reading fields of immutable records that happen to be in the same cache line. This is something that happens very easily and even commonly. I've had to work around these issues regularly while trying to optimize and benchmark multicore OCaml code.
False sharing can signficantly degrade performance unless carefully avoided. While developing this PR, I saw a major performance improvement with the Hashtbl after adding padding:
Benchmark 1: kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10
Time (mean ± σ): 123.3 ms ± 2.8 ms [User: 240.5 ms, System: 1.9 ms]
Range (min … max): 122.4 ms … 136.3 ms 24 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10
Time (mean ± σ): 94.1 ms ± 0.2 ms [User: 183.0 ms, System: 1.6 ms]
Range (min … max): 93.9 ms … 94.6 ms 31 runs
Summary
kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10 ran
1.31 ± 0.03 times faster than kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10
Benchmark 1: kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10
Time (mean ± σ): 68.7 ms ± 0.3 ms [User: 260.3 ms, System: 2.6 ms]
Range (min … max): 67.9 ms … 69.6 ms 43 runs
Benchmark 2: kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10
Time (mean ± σ): 53.3 ms ± 0.4 ms [User: 198.6 ms, System: 2.6 ms]
Range (min … max): 52.8 ms … 55.2 ms 55 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10 ran
1.29 ± 0.01 times faster than kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10
This PR adds support for padding to avoid false sharing.
OCaml Stdlib recently got added a way allocate cache-aligned atomics. To mirror the development, the
Loc
module is extended with amake_contended
function. However, the approach in this PR is to add an optional boolean argumentpadded
to relevant constructor functions, i.e.Loc.make ~padded:true
.Why not name the argument
contended
? False sharing is a specific kind of contention. However, you can have contention even when you don't have false sharing simply because the same location is being accessed by multiple domains / cores. Usingmake_contended
doesn't help with that at all and the termcontended
is potentially misleading.This PR also makes data structures use padded locations and also pad other parts where it should be advantageous using multicore-magic.
For background, false sharing has absolutely nothing to do with atomics per se. False sharing happens whenever there is at least one core writing and other cores accessing (reading or writing) something that happens to map to the same cache line (aligned region of memory). It doesn't have to involve atomics at all. The word being written to could just be a regular mutable record field (to which no other core has access to) and the other cores might just be reading fields of immutable records that happen to be in the same cache line. This is something that happens very easily and even commonly. I've had to work around these issues regularly while trying to optimize and benchmark multicore OCaml code.
False sharing can signficantly degrade performance unless carefully avoided. While developing this PR, I saw a major performance improvement with the
Hashtbl
after adding padding: