mratsim / weave

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead
Other
541 stars 21 forks source link

Multithreaded random number generation #147

Closed mratsim closed 4 years ago

mratsim commented 4 years ago

Several multithreaded algorithms require randomness, for example all the Monte-Carlo methods used in:

However due to the dynamic load balancing and task migration it is currently very complex to handle randomness.

One of the difficulties is that the RNG should be per-thread, this exclude user using {.threadvar.} because if there is a parallelism opportunity in a scope further away, either they will need to copy the RNG in the task, leading to the same streams of random numbers, or they have to take a reference to the RNG, leading to multiple thread contending for the same RNG state:

var rng {.threadvar.}: Rand
rng = initRand(1234)
let adr_rng = rng.addr
parallelFor i in 0 ..< 10:
  captures: {rng, adr_rng}
  discard

One of the non-difficulty is that multithreading is inherently non-deterministic due to the OS scheduling and in-particular floating point sum being non-associative, which means parallel sum reductions are not deterministic.

That said, we could provide access to user to the thread-local RNG used by Weave via a special API similar to Nim random. It wouldn't be seedable, and even if we have a separate RNG for users and for Weave, as it would be per thread and task migration is non-deterministic, the seeding wouldn't change anything.

Alternatively we can look at Pedigree-based parallel RNG from Leiserson 2012 http://supertech.csail.mit.edu/papers/dprng.pdf

Papers that cite this papers might also provide the recent insights.

dumblob commented 4 years ago

There are known solutions to this problem (e.g. http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/ ), but the programmer has to be aware of that (at best the compiler shall complain about a slow/blocking/otherwise_inadequate PRNG) and that indeed seems not transparent currently.

peteroupc commented 4 years ago

I am aware of splittable PRNGs, such as the one used by JAX. A JAX design document, https://github.com/google/jax/blob/master/design_notes/prng.md , explains in detail why JAX uses a splittable PRNG, and one of the reasons is parallelizability.

mratsim commented 4 years ago

@dumblob, I've looked at your link it solves the RNG problem but it's not deterministic.

Determinism and result reproducibility is a huge problem in particular in machine learning with thousands of data scientists banging their heads on non-reproducible result. Case in points in my own project the first thing I do is setting all random seeds: https://github.com/mratsim/Amazon-Forest-Computer-Vision/blob/f1fefbb6/main_pytorch.py#L91-L95

    # Setting random seeds for reproducibility. (Caveat, some CuDNN algorithms are non-deterministic)
    torch.manual_seed(1337)
    torch.cuda.manual_seed(1337)
    np.random.seed(1337)
    random.seed(1337)

Obviously that is not enough to guarantee reproducibility because of floating point rounding but it's a very important problem:

@peteroupc, it seems like the splittable RNG actually starts from the same idea as @dumblob link: augmenting a RNG with a hash function. It also cites Leiserson paper and highlight some weaknesses in the discussion section. I'll have to study that a bit more.

The main challenge in my runtime is that parallel for loops are split lazily "just-in-time", even in the middle of a task, based on the current system load so a parallel for loop might be 1 task, 2 task or 1000 tasks, unlike TBB or OpenMP implementations that know before executing the loop.

If the split overhead is small I can split for each loop iteration but that would be millions of splits for 2 loops across 1000x1000 matrices so it needs to be really cheap and really small in state as well to not flush the caches. In Leiserson's, the pedigree is based on the loop iteration so that would work whether the task is split or not, I expect the overhead is small as it seems to have been designed with that use-case in mind

image

However they claim that their DotMix RNG is high quality but from their graphics it's similar to a MersenneTwister?

image

I saw your page on RNG by the way, very nice: https://peteroupc.github.io/random.html

Other references (for my self):

mratsim commented 4 years ago

Closing, this can be handled as a library:

The following PR can be used as an example: https://github.com/mratsim/trace-of-radiance/pull/1/files#diff-58d680dde954134c145b9161eb3ca0b7R21-R29

With parallelism: https://github.com/mratsim/trace-of-radiance/pull/3/files

Note on the seeding function.

Nim builtin seeding is not adequate, it initializes the high 64-bit of the xoroshiro RNG with lots of 0 for small numbers. https://github.com/nim-lang/Nim/blob/acae3b02c7d3b7d916430223c5a2948e68d83ea4/lib/pure/random.nim#L573-L575

Perf notes

The combined cost of:

is being 3% slower than Nim stdlib which is acceptable and comparable to Leiserson 2012

It also does not require more space in the task object for which space is at a premium, it carries 84 bytes of metadata, 144 bytes of task data for a total of 228 bytes, and there is a hard limit of 256 bytes due to the memory pool block size. With GPU tasks, NUMA aware scheduling and distributed computing the last 28 bytes will probably used (assuming we use an unified task abstraction for all).