A tricky issue in message-passing based work-requesting/work-stealing is how to efficiently select victims, especially with a large number of cores.
The initial intuition is to use a bitset, but an uint64 can only address up to 64 cores.
What if we have a 8-socket Xeon Platinum 9282 with 56 cores - 112 threads each\
hence a total of 896 threads?\
A 896-bit bitset would be huge, and we don't really want to have this be a compile-time parameter.
The current solution in Weave is unsatisfactory, we send to the victims an address to our own set of tried/untried victims. This works with shared-memory but not in a distributed computing setting.
The following recurrence generates all random numbers mod m without repetition:
Xₙ₊₁ = aXₙ+c (mod m)
if and only if (Hull-Dobell theorem):
c and m are coprime
a-1 is divisible by all prime factors of m
a-1 is divisible by 4 if m is divisible by 4
The only state to send in steal request would be a and c. Xₙ is the victim ID and m is either the number of threads or for ease of implementation the next power of 2.
iterator pseudoRandomPermutation(randomSeed: uint32, maxExclusive: int32): int32 =
## Create (low-quality) pseudo-random permutations for [0, max)
# Design considerations and randomness constraint for work-stealing, see docs/random_permutations.md
#
# Linear Congruential Generator: https://en.wikipedia.org/wiki/Linear_congruential_generator
#
# Xₙ₊₁ = aXₙ+c (mod m) generates all random number mod m without repetition
# if and only if (Hull-Dobell theorem):
# 1. c and m are coprime
# 2. a-1 is divisible by all prime factors of m
# 3. a-1 is divisible by 4 if m is divisible by 4
#
# Alternative 1. By choosing a=1, all conditions are easy to reach.
#
# The randomness quality is not important besides distributing potential contention,
# i.e. randomly trying thread i, then i+1, then i+n-1 (mod n) is good enough.
#
# Assuming 6 threads, co-primes are [1, 5], which means the following permutations
# assuming we start with victim 0:
# - [0, 1, 2, 3, 4, 5]
# - [0, 5, 4, 3, 2, 1]
# While we don't care much about randoness quality, it's a bit disappointing.
#
# Alternative 2. We can choose m to be the next power of 2, meaning all odd integers are co-primes,
# consequently:
# - we don't need a GCD to find the coprimes
# - we don't need to cache coprimes, removing a cache-miss potential
# - a != 1, so we now have a multiplicative factor, which makes output more "random looking".
# n and (m-1) <=> n mod m, if m is a power of 2
let maxExclusive = cast[uint32](maxExclusive)
let M = maxExclusive.nextPowerOfTwo_vartime()
let c = (randomSeed and ((M shr 1) - 1)) * 2 + 1 # c odd and c ∈ [0, M)
let a = (randomSeed and ((M shr 2) - 1)) * 4 + 1 # a-1 divisible by 2 (all prime factors of m) and by 4 if m divisible by 4
let mask = M-1 # for mod M
let start = randomSeed and mask
var x = start
while true:
if x < maxExclusive:
yield cast[int32](x)
x = (a*x + c) and mask # ax + c (mod M), with M power of 2
if x == start:
break
You are genius! I use congruential generators for many different use cases but this use case I never thought of. So simple yet so efficient. Congratulations!
A tricky issue in message-passing based work-requesting/work-stealing is how to efficiently select victims, especially with a large number of cores.
Instead we can use a Linear Congruential Generator: https://en.wikipedia.org/wiki/Linear_congruential_generator
The following recurrence generates all random numbers mod m without repetition:
Xₙ₊₁ = aXₙ+c (mod m)
if and only if (Hull-Dobell theorem):
The only state to send in steal request would be
a
andc
.Xₙ
is the victim ID andm
is either the number of threads or for ease of implementation the next power of 2.Nim code: https://github.com/mratsim/constantine/blob/1dfbb8bd4f2fefc4059f4aefb040f0b3ba2918cb/constantine/platforms/threadpool/threadpool.nim#L84-L128