Use Lnear Congruential Generator to generate work-stealing permutation

A tricky issue in message-passing based work-requesting/work-stealing is how to efficiently select victims, especially with a large number of cores.

The initial intuition is to use a bitset, but an uint64 can only address up to 64 cores.
- What if we have a 8-socket Xeon Platinum 9282 with 56 cores - 112 threads each\ hence a total of 896 threads?\ A 896-bit bitset would be huge, and we don't really want to have this be a compile-time parameter.
The current solution in Weave is unsatisfactory, we send to the victims an address to our own set of tried/untried victims. This works with shared-memory but not in a distributed computing setting.

Instead we can use a Linear Congruential Generator: https://en.wikipedia.org/wiki/Linear_congruential_generator

The following recurrence generates all random numbers mod m without repetition:

Xₙ₊₁ = aXₙ+c (mod m)

if and only if (Hull-Dobell theorem):

c and m are coprime
a-1 is divisible by all prime factors of m
a-1 is divisible by 4 if m is divisible by 4

The only state to send in steal request would be a and c. Xₙ is the victim ID and m is either the number of threads or for ease of implementation the next power of 2.

Nim code: https://github.com/mratsim/constantine/blob/1dfbb8bd4f2fefc4059f4aefb040f0b3ba2918cb/constantine/platforms/threadpool/threadpool.nim#L84-L128

iterator pseudoRandomPermutation(randomSeed: uint32, maxExclusive: int32): int32 =
  ## Create (low-quality) pseudo-random permutations for [0, max)
  # Design considerations and randomness constraint for work-stealing, see docs/random_permutations.md
  #
  # Linear Congruential Generator: https://en.wikipedia.org/wiki/Linear_congruential_generator
  #
  # Xₙ₊₁ = aXₙ+c (mod m) generates all random number mod m without repetition
  # if and only if (Hull-Dobell theorem):
  # 1. c and m are coprime
  # 2. a-1 is divisible by all prime factors of m
  # 3. a-1 is divisible by 4 if m is divisible by 4
  #
  # Alternative 1. By choosing a=1, all conditions are easy to reach.
  #
  # The randomness quality is not important besides distributing potential contention,
  # i.e. randomly trying thread i, then i+1, then i+n-1 (mod n) is good enough.
  #
  # Assuming 6 threads, co-primes are [1, 5], which means the following permutations
  # assuming we start with victim 0:
  # - [0, 1, 2, 3, 4, 5]
  # - [0, 5, 4, 3, 2, 1]
  # While we don't care much about randoness quality, it's a bit disappointing.
  #
  # Alternative 2. We can choose m to be the next power of 2, meaning all odd integers are co-primes,
  # consequently:
  # - we don't need a GCD to find the coprimes
  # - we don't need to cache coprimes, removing a cache-miss potential
  # - a != 1, so we now have a multiplicative factor, which makes output more "random looking".

  # n and (m-1) <=> n mod m, if m is a power of 2
  let maxExclusive = cast[uint32](maxExclusive)
  let M = maxExclusive.nextPowerOfTwo_vartime()
  let c = (randomSeed and ((M shr 1) - 1)) * 2 + 1 # c odd and c ∈ [0, M)
  let a = (randomSeed and ((M shr 2) - 1)) * 4 + 1 # a-1 divisible by 2 (all prime factors of m) and by 4 if m divisible by 4

  let mask = M-1                                   # for mod M
  let start = randomSeed and mask

  var x = start
  while true:
    if x < maxExclusive:
      yield cast[int32](x)
    x = (a*x + c) and mask                         # ax + c (mod M), with M power of 2
    if x == start:
      break

mratsim / weave

Use Lnear Congruential Generator to generate work-stealing permutation #193