Threadpool deadlocks on common multithreading examples

mratsim commented 5 years ago

Even though it's incredibly inefficient, the "Hello World" of task parallelism is computing Fibonacci in recursive way, spawning about 2^n tasks.

Unfortunately, Nim threadpool reliably deadlocks on relatively small inputs:

fib 6: on my dual-core laptop
fib 16: on my 18-core workstation

Reproduction

import
  # STD lib
  os, strutils, threadpool, strformat

# Compile with "nim c --threads:on -d:release -d:danger --outdir:build 

proc parfib(n: uint64): uint64 =
  if n < 2:
    return n

  let x = spawn parfib(n-1)
  let y = parfib(n-2)

  return ^x + y

proc main() =
  if paramCount() != 1:
    echo "Usage: fib <n-th fibonacci number requested>"
    quit 0

  let n = paramStr(1).parseUInt.uint64

  let f = parfib(n)

  echo "Result: ", f

main()

pb-cdunn commented 5 years ago

Same for me.

% nim -v
Nim Compiler Version 0.20.99 [Linux: amd64]
git hash: 170cb4f4cbe734d6fa30415ddfea725cbe35f84e

fib 15 is fine.
fib 16 hangs

Oddly, with the 0.20.2 version (88a0edb), this stalls as early as fib 15, but again reliably. With nim-0.17.0 it hangs on fib 16.

mratsim commented 4 years ago

I did some more experiments today, the threadpool also deadlocks on common task parallelism benchmarks:

depth-first search
nqueens
heat diffusion equation

Source: https://github.com/mratsim/weave/commit/be776c417c5262479db9577cd140ffa95023d3c4

spacepluk commented 4 years ago

is this a deadlock? it looks more like recursion is exhausting the threads in the pool

mratsim commented 4 years ago

When I checked with ps aux for the thread state indicator, iirc it was in uninterruptible sleep state

spacepluk commented 4 years ago

I'm still trying to wrap my head around this fib implementation, but I think that by recursively spawning threads this way they are effectively blocking one another until the threadpool runs out of threads. Then the last call to spawn blocks waiting for the threadpool to give it a new thread which is impossible at that point.

mratsim commented 4 years ago

You may be right

I rewrote the example with preferSpawn to ensure that the thread pool has threads ready, it now deadlocks at fib 31:

import
  # STD lib
  os, strutils, threadpool, strformat

# Compile with "nim c --threads:on -d:danger --outdir:build

proc parfib(n: uint64): uint64 =
  if n < 2:
    return n

  if preferSpawn():
    let x = spawn parfib(n-1)
    let y = parfib(n-2)
    return ^x + y
  else:
    let x = parfib(n-1)
    let y = parfib(n-2)
    return x + y

proc main() =
  if paramCount() != 1:
    echo "Usage: fib <n-th fibonacci number requested>"
    quit 0

  let n = paramStr(1).parseUInt.uint64

  let f = parfib(n)

  echo "Result: ", f

main()

Note: the spawnX template doesn't work when a spawned function returns something.

One of the issues I can see is that the preferSpawn -> spawn sequence is not atomic. so it may just trigger pool exhaustion as well if 2 threads compete for the last thread in the pool.

spacepluk commented 4 years ago

I'm still pretty new to nim so don't take my word on it. But since it doesn't freeze on every execution, I would say that this needs a mutex mutex somewhere and the pool is being exhausted between the calls to preferSpawn and spawn.

I still can't think of a situation where it would make sense to spawn threads this way because the end result is going to be sequential anyway.

mratsim commented 4 years ago

Fibonacci is a toy example, there are much faster ways to implement it, but it's useful to understand the overhead of a multithreading runtime since the only computation done is so small.

The main benefit of a low overhead multithreading runtime is that the developer doesn't have to worry anymore about the task size before using spawn and those are huge worries in:

Recursive algorithms, some tasks at the leaves might be too small, sometimes you can ensure to not divide-and-conquer too deep, but sometimes by the time you realize that the task is small it's too late and backtracking would be costly as well.
In numerical computing, with the standard OpenMP, the best task size varies wildly between CPUs and the operations: https://github.com/zy97140/omp-benchmark-for-pytorch
- don't parallelize a copy/add/mul on less than 80k elements for a Xeon platinum but 8k on a weaker CPU
- beyond 1k element it's always worth it to parallelize exponential.

This is mainly due to 2 things:

tasks should be big enough to amortize the coordination overhead.
the load balancing, the cost of creating tasks.

A runtime that doesn't address those makes it difficult to write a generic higher-order function like map on top as you need to hardcode some heuristics. While Nim default threadpool doesn't implement load balancing (it's complex), I think it shouldn't choke common parallel benchmarks.

At least it's not alone:

GCC OpenMP is worse, it deadlocks with some CPUs stuck at 100%
Julia is in the 8s for fib(40) but it terminates at least https://github.com/JuliaLang/julia/issues/33762
HPX (a large contributor to C++ async) is similar to Nim and GCC https://github.com/STEllAR-GROUP/hpx/issues/4227
Can't say about Rust, V, D, Crystal, they don't seem to focus on task parallelism.

Fibonacci is part of the following suites:

PARSEC (Princeton Application Repository for Shared-Memory Computers) (https://github.com/bamos/parsec-benchmark/tree/master/pkgs/libs/tbblib/src/examples/test_all/fibonacci)
BOTS (Barcelona OpenMP Task Suite) from Barcelona Super Computing Center (https://github.com/bsc-pm/bots)

And was used to evaluate multithreading runtimes by the US-led scientific programs: https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf

spacepluk commented 4 years ago

Thanks for all the info. I still don't think this is a common case at all. I tried and failed to find an example in your links where they spawn threads recursively.

IMHO what you're seeing here is the parallel equivalent of a stack overflow, and that would generally be considered a bug in your code. Of course, some languages have TCO but that's a design choice and not having it isn't fundamentally wrong.

mratsim commented 4 years ago

In universities people are introduced to parallel computing with recursive algorithms: http://ipcc.cs.uoregon.edu/lectures/lecture-9-fork-join.pdf

In scientific computing, recursive divide-and-conquer is the standard, and also often the fastest for lots of matrix operations, see this course for example: https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms2.pdf

So I think it's important to support those workloads. As a nice benefit, I do think Nim is a much nicer language to approach than C++ or Fortran, especially on professors slides ;).

Besides the fact that it's ubiquitous, the main reason the fork-join / divide-and-conquer model is widespread is that it's efficient. The Cilk paper (http://supertech.csail.mit.edu/papers/PPoPP95.pdf) has been foundational and has formally proven bounds on the speedup of fork-join parallelism backed by work-stealing on N processors (a idle thread steals task in a busy thread task queue). And the Cilk approach is mathematically optimal (there is at least another optimal alternative I am aware that Julia is using, but it's still optimized for recursive spawns).

Regarding concrete examples of divide-and-conquer

Be sure to read the HPC course from the previous paragraph (https://www.cs.fsu.edu/~engelen/courses/HPC/Algorithms2.pdf)

parallel sum
bucket sort
Barnes Hut (for n-bodies / gravity simulation)
Matrix Multiplication
Convex Hull (creating bounding areas)
Parallel recurrences (fibonacci is one)
Matrix inversion
Matrix decomposition

N-queens spawns task recursively, and billions of tasks at that: https://github.com/mratsim/weave/blob/c26cfb2c7efbdb28ad7459a61bd567419d9e22c5/benchmarks/nqueens/weave_nqueens.nim#L111-L149

proc nqueens_par(n, j: int32, a: CharArray): int32 =

  if n == j:
    # Good solution, count it
    return 1

  var localCounts = alloca(Flowvar[int32], n)
  zeroMem(localCounts, n * sizeof(Flowvar[int32]))

  # Try each position for queen `j`
  for i in 0 ..< n:
    var b = alloca(char, j+1)
    copyMem(b, a, j * sizeof(char))
    b[j] = cast[char](i)
    if isValid(j+1, b):
      localCounts[i] = spawn nqueens_par(n, j+1, b)

  for i in 0 ..< n:
    if localCounts[i].isSpawned():
      result += sync(localCounts[i])

const solutions = [
  1,
  0,
  0,
  2,
  10, # 5x5
  4,
  10,
  92, # 8x8
  352,
  724, # 10x10
  2680,
  14200,
  73712,
  365596,
  2279184, # 15x15
  14772512
]

Cache-oblivious matrix multiplication is also recursive: https://github.com/mratsim/weave/blob/c26cfb2c7efbdb28ad7459a61bd567419d9e22c5/benchmarks/matmul_cache_oblivious/weave_matmul_co.nim#L101-L143

proc matmul[T](A, B, C: Matrix[T], m, n, p: int, add: bool): bool =
  # The original bench passes around a ``ld`` parameter (leading dimension?),
  # we store it in the matrices
  # We return a dummy bool to allow waiting on the matmul

  # Threshold
  if (m + n + p) <= 64:
    if add:
      for i in 0 ..< m:
        for k in 0 ..< p:
          var c = 0.T
          for j in 0 ..< n:
            c += A[i, j] * B[j, k]
          C[i, k] += c
    else:
      for i in 0 ..< m:
        for k in 0 ..< p:
          var c = 0.T
          for j in 0 ..< n:
            c += A[i, j] * B[j, k]
          C[i, k] = c

    return

  var h0, h1: FlowVar[bool]
  ## Each half of the computation

  # matrix is larger than threshold
  if m >= n and n >= p:
    let m1 = m shr 1 # divide by 2
    h0 = spawn matmul(A, B, C, m1, n, p, add)
    h1 = spawn matmul(A.stride(m1, 0), B, C.stride(m1, 0), m - m1, n, p, add)
  elif n >= m and n >= p:
    let n1 = n shr 1 # divide by 2
    h0 = spawn matmul(A, B, C, m, n1, p, add)
    h1 = spawn matmul(A.stride(0, n1), B.stride(n1, 0), C, m, n - n1, p, add = true)
  else:
    let p1 = p shr 1
    h0 = spawn matmul(A, B, C, m, n, p1, add)
    h1 = spawn matmul(A, B.stride(0, p1), C.stride(0, p1), m, n, p - p1, add)

  discard sync(h0)
  discard sync(h1)

Any tree algorithm is easier to do when you can spawn recursively, the unbalanced tree search used in the Sandia US-lab link is the gold standard to compare runtimes and is recursive, implementation in C: https://github.com/bsc-pm/bots/blob/2607a695128727a3f6c98754367705443663136d/omp-tasks/uts/uts.c#L169-L220.

Similarly, in data science, and machine learning, there are many trees algorithms that can benefit from recursive spawning and would be simplified if we didn't had to worry about grain size:

random forests and gradient boosted trees: https://kirnhans.github.io/15418-project/
Monte Carlo Tree Search, which powers all the Go AI like AlphaGo for example and has been tough to parallelize efficiently and still active area of research
Beam Search for natural language processing
Fast Fourier Transform: https://github.com/bsc-pm/bots/blob/master/omp-tasks/fft/fft.c#L218

spacepluk commented 4 years ago

Cool stuff, thanks for all the links. I'm familiar with fork-join and divide and conquer. I think we just have different expectations.

The point I'm failing to articulate is that in the fib example above the branch that spawns will always exhaust the pool when the depth reaches the thread count. And for me that's perfectly normal and acceptable when working with a simple thread pool. You can totally implement mechanisms that deal with it under the hood and allow you to write pretty functional code (which I love). But there's a cost to that as well that shouldn't be ignored, and again that's not what I would expect of a thing called thread pool :)

pb-cdunn commented 4 years ago

I see your point. It would be nice to have recursive thread-spawning, but it's also ok for the standard threadpool implementation to expect spawns from the main thread only.

However, spawnX would be a nice solution if it worked with functions that return values.

Not long ago, I had much more severe problems with threadpool. It would deadlock even from strictly main-thread usage, after I simply reduced the threadpool size. I need to be able to control the number of threads at runtime. (I reduce the size before using the threadpool, but the Nim's standard library threadpool initializes at program start-up.) That bug seems to be fixed, but I would rather have an instance of a threadpool anyway. It's just simpler. So I have switched to an alternative threadpool implementation:

mratsim commented 4 years ago

AFAIK @Araq asked me often on IRC if we should just switched to @yglukhov threadpool in the standard lib, and I think we should. Hopefully now that ARC is researched and implemented he has time to write a Threading API RFC and leave the implementation to us.

Araq commented 4 years ago

@mratsim Excuse me, but aren't you more qualified to write such an RFC?

mratsim commented 4 years ago

Well you kept saying that you will write one on IRC, but yes I can write one.

pb-cdunn commented 3 years ago

Where are we on this?

I was looking at weave, which is awesome, but it uses globals (right?) and I really want an instance of a threadpool. That's the only reason I'm still using yglukhov's.

mratsim commented 3 years ago

Yes it uses globals, I do plan to migrate away from them (https://github.com/mratsim/weave/issues/9).

There are 4 "globals" used: https://github.com/mratsim/weave/blob/e5a3701d/weave/contexts.nim#L28-L34 which are

true globals: https://github.com/mratsim/weave/blob/e5a3701d/weave/datatypes/context_global.nim#L23-L74
thread-local "globals": https://github.com/mratsim/weave/blob/e5a3701d/weave/datatypes/context_thread_local.nim#L19-L110

All the operations can be changed to accept a global and thread-local context and then Weave can either use a global like today or work with pool instances with pool.spawn(...). The only limitations is that memory pools have to use a global https://github.com/mratsim/weave/blob/e5a3701d/weave/memory/memory_pools.nim.

Also for the standard library @Araq and I agree that @yglukhov is much more suited to it. Weave is too complex to dump its maintenance to standard library maintainers.

I do plan to have a much simpler (maybe 2000 lines at most) work-stealing threadpool (or threadpools) anyway because I need one for parallelizing some cryptographic workloads and we certainly don't want to have a potential bug/attack surface as large as Weave for crypto (memory management, globals, state machines, heap allocation, ...).

My main issues with @yglukhov threadpools are that they also deadlock on common (dumb) examples (Fibonacci) and they use Nim channels and Nim channels have RTTI magic that seems super complex to reproduce, they are also likely to have low throughput due to using a single lock for the whole channel instead of 2 locks for reader and writers (which is actually not that much slower than the lock-free channel/queues I use in Weave for implementing FlowVars).

pb-cdunn commented 3 years ago

That sounds like a great idea for weave: Make instance versions available (except for memory pools).

For stdlib, yeah, I like yglukhov/threadpools. It too could have a global version and an instance version for the main functions. It also needs a bit of updating for the latest Nim version.

I've never had a deadlock from yglukhov/threadpools, and I had deadlocks all the time from the stdlib threadpool. Could you point to a testcase?

mratsim commented 3 years ago

Unfortunately my 18 cores machine is out-of-order at the moment but it was just a plain Fibonacci. That said, GCC OpenMP (but not Clang OpenMP or Intel OpenMP and Intel TBB) also choke on this :/.

pb-cdunn commented 3 years ago

I have ~48~ 72 cores at Pacific Biosciences. I'd be happy to run your test, and maybe debug the library.

mratsim commented 2 years ago

See Taskpools https://github.com/status-im/nim-taskpools

https://forum.nim-lang.org/t/8791

pb-cdunn commented 2 years ago

Very nice. This belongs in the std lib.

@brentp , if you can upgrade to Nim 1.6, this should solve all your problems.

nim-lang / Nim

Threadpool deadlocks on common multithreading examples #11922

Regarding concrete examples of divide-and-conquer