Should we avoid using fat pointers for SFT?

qinsoon commented 1 year ago

We currently store SFT entries as &dyn SFT which is a fat pointers. This is only efficient when the platform supports atomic instructions for double word size, i.e. 128 bits atomics on 64bits machines, and 64 bits atomics on 32 bits machines. We will give a warning in a debug build if the platform does not support double word atomics, in which case, a fall back locking mechanism is used to ensure the atomic access.

If we would like to provide efficient implementation for platforms that do not support 128 bits atomics, we should consider not using &dyn SFT. We need a way to represent an SFT entry as a normal pointer.

qinsoon commented 1 year ago

As we use portable_atomic for 128 bits atomics, they list the support for different platforms: https://github.com/taiki-e/portable-atomic/blob/HEAD/src/imp/atomic128/README.md

x86_64:
- Load/store requires cmpxchg16b, or vmovdqa. CAS/RMW requires cmpxchg16b. We currently only need load/store.
- cmpxchg16b starts from x86_64-v2 (roughly 2009) source.
- vmovdqa is in AVX extension (starts in 2011) source
aarch64:
- LDXP/STXP loop from ARMv8 (2011)
- CASP (DWCAS) added as FEAT_LSE (mandatory from armv8.1-a) (2014)
- LDP/STP (DW load/store) if FEAT_LSE2 (optional from armv8.2-a, mandatory from armv8.4-a) (2017)
- LDIAPP/STILP (DW acquire-load/release-store) added as FEAT_LRCPC3 (optional from armv8.9-a/armv9.4-a)
- LDCLRP/LDSETP/SWPP (DW RMW) added as FEAT_LSE128 (optional from armv9.4-a)
- source: 1, 2
powerpc64, s390x: may have instructions for 128 bits atomics, but I did not explore further.
Other platforms: use fallback implementation.

For x86_64 and aarch64, we should be fine. On platforms that use fallback implementation, we will have a performance issue.

wks commented 1 year ago

I confirmed using cross-compilation toolchain that on aarch64-unknown-linux-gnu, AtomicU128 is using an LL/SC loop (LDXP, STXP and CBNZ) to implement both atomic load and atomic store for AtomicU128. (If it is load, it stores the same value back and see if it succeeds; if it is a stroe, it stores a different value.) This is bad because it effectively turns every load into a load-store pair.

On riscv64gc-unknown-linux-gnu, AtomicU128 is not lock-free.

wks commented 1 year ago

If we would like to provide efficient implementation for platforms that do not support 128 bits atomics, we should consider not using &dyn SFT. We need a way to represent an SFT entry as a normal pointer.

I think there are many possibilities.

For the generous heap layout where each space occupies a dedicated 2TB region, we can eagerly populate the SFTSpaceMap. From then on, we only need read-only access for SFT entries.

For SFTDenseChunkMap, only the first level map (from chunk address to space index) needs to be atomic. The second level is a map from space index to &dyn SFT. The second level can be eagerly populated and only needs read-only access. Because we can't have too many spaces, the first level only needs one byte per chunk, and there is no problem making that atomic.

For SFTSparseChunkMap, each chunk maps directly to a &dyn SFT. But if every sft_map entry can only change from NULL to a non-NULL &dyn SFT, we can split a &dyn SFT into two words, and load each of them atomically (that is, two 64-bit atomic loads, but not one 128-bit atomic load). If we see one of the two words being NULL, we repeat loading until we see both words are non-NULL. We only call methods of the &dyn SFT after we got both non-NULL words, combine them, and transmute the combined 128-bit word into a &dyn SFT.

qinsoon commented 1 year ago

Cross compiling without target-feature probably will results in the most conservative code. But that's fair. If we produce a binary, we won't include a specific target feature. I am a bit surprised that portable_atomics has runtime feature detection, but they did not implement atomic load using the detection on aarch64 linux (they use it for CAS). Maybe LDXP/STXP is cheaper than detection + LSE instructions?

You are right that we probably can refactor and do not really need atomic access to SFT for SFTSpaceMap and SFTDenseChunkMap. For SFTSparseChunkMap, we are not changing from NULL to non-NULL, we mutate between an empty SFT (EMPTY_SPACE_SFT) and a valid SFT. We cannot tell if it is for an empty space until we dereference it.

mmtk / mmtk-core

Should we avoid using fat pointers for SFT? #945