Effective lock implementation for virtual threads on NUMA architecture in Java, which is called NUMA_MCS Also my bachelor's diploma at ITMO university. Also my research project.
Researching, implementing and benchmarking NUMA aware locks for light (virtual) threads in Java. The target platform is Kunpeng-920 which has ARM-architecture.
This work could not be done without help of @Aksenov239, @anton-malakhov, @AndreyChurbanov.
https://github.com/ricnorr/jdk19/pull/3
Run ./scripts/install_java.sh
Run ./scripts/activate-conda.sh
from project's root to install miniconda
Run ./scripts/create-conda-envs.sh
from project's root to install python libraries
Run ./scripts/build_libs.sh
from project's root to compile native library
Run ./scripts/run_bench.sh
from the project's root
Results are saved in results/benchmark_results.csv
Pictures saved in results/pictures
Pdf saved in results/result.pdf
Benchmarks with label “48 cores” in title are measured on a system with HiSilicon Kunpeng-920, 48 cores, 2 NUMA nodes. Benchmarks with label “128 cores” in title are measured on a system with HiSillocon Kunpeng-920, 128 cores, 2 NUMA nodes.
Work is shared between threads, each thread in cycle does a work: multiply matrices (square matrices of size “out of c.s”), yields, acquires lock, multiple matrices (square matrix of size “in c.s”, which means in critical section), yields, releases lock. The throughput is a ratio of iterations number sum and total time.
NUMA_MCS consists of many MCS queues and boolean flag. Vthread tries to execute a CAS operation on a flag. If the operation is successful, the vthread becomes a lock owner, otherwise “slow path” happens. Slow path of lock acquiring consists of receiving NUMA id and joining the appropriate MCS queues. When the vthread is a leader of the local MCS queue, it waits for successful CAS on the flag. Scheme of lock acquiring is presented on the picture 1.
Picture 1. NUMA_MCS lock acquisition. VT1 from numa2 executes unsuccessful CAS and joins the local queue.
When the vthread releases the lock, it sets the flag to false and unpark the next vthread in the local queue.
The main problem of implementing effective locks for virtual thread on NUMA is the effect, when vthread is yielded inside critical section and resumed on another NUMA. Virtual threads migrate between NUMA nodes more often than platform threads. So NUMA_MCS expects that sometimes we are lucky and vthread enters and leaves critical section on the same NUMA. In the picture 2 presented benchmarks, comparing lock with local queue for each NUMA (NUMA_MCS) and NUMA_MCS with just one queue (NUMA_MCS_ONE_QUEUE).
Picture 2. Multiplication of matrices 50x50 inside the critical section, and doing nothing outside of the critical section. Very high contention.
To get NUMA id you can perform syscall getcpu and cache the information. I suggest to cache the value on the carrier thread side and sometimes update information.
This annotation is used to split data to different cache-lines and is a preferred way to solve the false-sharing problem. The cache line size can be set when initialising a VM.
Below presented benchmarks comparing NUMA_MCS and ReentrantLock from the Java standard library. Results show that NUMA_MCS is more effective than ReentrantLock.