ocaml-multicore / kcas

Software Transactional Memory for OCaml
https://ocaml-multicore.github.io/kcas/
ISC License
109 stars 11 forks source link

Support padding to avoid false sharing #116

Closed polytypic closed 1 year ago

polytypic commented 1 year ago

This PR adds support for padding to avoid false sharing.

OCaml Stdlib recently got added a way allocate cache-aligned atomics. To mirror the development, the Loc module is extended with a make_contended function. However, the approach in this PR is to add an optional boolean argument padded to relevant constructor functions, i.e. Loc.make ~padded:true.

Why not name the argument contended? False sharing is a specific kind of contention. However, you can have contention even when you don't have false sharing simply because the same location is being accessed by multiple domains / cores. Using make_contended doesn't help with that at all and the term contended is potentially misleading.

This PR also makes data structures use padded locations and also pad other parts where it should be advantageous using multicore-magic.

For background, false sharing has absolutely nothing to do with atomics per se. False sharing happens whenever there is at least one core writing and other cores accessing (reading or writing) something that happens to map to the same cache line (aligned region of memory). It doesn't have to involve atomics at all. The word being written to could just be a regular mutable record field (to which no other core has access to) and the other cores might just be reading fields of immutable records that happen to be in the same cache line. This is something that happens very easily and even commonly. I've had to work around these issues regularly while trying to optimize and benchmark multicore OCaml code.

False sharing can signficantly degrade performance unless carefully avoided. While developing this PR, I saw a major performance improvement with the Hashtbl after adding padding:

Benchmark 1: kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10
  Time (mean ± σ):     123.3 ms ±   2.8 ms    [User: 240.5 ms, System: 1.9 ms]
  Range (min … max):   122.4 ms … 136.3 ms    24 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10
  Time (mean ± σ):      94.1 ms ±   0.2 ms    [User: 183.0 ms, System: 1.6 ms]
  Range (min … max):    93.9 ms …  94.6 ms    31 runs

Summary
  kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10 ran
    1.31 ± 0.03 times faster than kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 2 1_000_000 1000 10

Benchmark 1: kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10
  Time (mean ± σ):      68.7 ms ±   0.3 ms    [User: 260.3 ms, System: 2.6 ms]
  Range (min … max):    67.9 ms …  69.6 ms    43 runs

Benchmark 2: kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10
  Time (mean ± σ):      53.3 ms ±   0.4 ms    [User: 198.6 ms, System: 2.6 ms]
  Range (min … max):    52.8 ms …  55.2 ms    55 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  kcas-this/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10 ran
    1.29 ± 0.01 times faster than kcas-main/_build/default/test/kcas_data/hashtbl_bench.exe 4 1_000_000 1000 10