mratsim / weave

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead
Other
532 stars 22 forks source link

parallelFor has arguments that do not fit in the parallel tasks data buffer #168

Closed HJarausch closed 3 years ago

HJarausch commented 3 years ago

The attached programme gives

parallelFor has arguments that do not fit in the parallel tasks data buffer.
  Argument types: (int, array[0..1819, int], array[0..1819, array[0..15, int]])
  Current size: 247528
  Maximum size allowed: 144

How to enlarge the parallel tasks data buffer ?

Where comes 144 from?

I haven't found anything in the documentation. Thanks for hint, Helmut Count_WV.nim.txt

mratsim commented 3 years ago

You should send a pointer to the array not the full array. Even if I allow having a larger data buffer, your code will update a copy and not the original buffer so changes won't be propagated.

Unfortunately there is no way to capture an openarray to send across threads so we need to resort to pointers (and it's unsafe if you can return from a function without ensuring that the spawned task is finished).

In C OpenMP, there is no need for that because arrays == pointers which is not the case in Nim.

HJarausch commented 3 years ago

Sorry, I have to admit that I don't understand the logic of Weave, at all. Now I have changed the parallel loop to

var
  vi = 0
  values : array[1820,int]
  sols   : array[1820,array[16,int]]
  val_vi  : int
  sol_vi  : array[16,int]

syncScope():  # EXTREMELY IMPORT : no sync after the loops otherwise
  parallelFor i in 0 .. 12 :
    captures: {vi,val_vi,sol_vi}
    for j in i+1 .. 13 :
      for k in j+1 .. 14 :
       for l in k+1 .. 15 :
          (val_vi,sol_vi) = Optimize(i,j,k,l)
          values[vi]= val_vi
          sols[vi]= sol_vi
          inc(vi)

and now I get Error: 'val_vi' cannot be assigned to - why that?

Furthermore, I have a AMD threadripper with 16 cores and 32 hardware threads. When Weave is running, I see that it "only" uses 16 threads. (If I compile the Linux kernel, e.g., I see 32 threads running)

Thanks for your patients. For me, Weave looks much more complicated than the experimental 'parallel clause' which just works

mratsim commented 3 years ago

Your val_vi and sol_vi should be unique per thread otherwise all threads will try to update those shared variables and you will get undefined results, this would be true in OpenMP as well.

So your code needs to be rewritten like so:

var
  values : array[1820,int]
  sols   : array[1820,array[16,int]]

let pvalues = values.addr
let psols = sols.addr

syncScope():  # EXTREMELY IMPORT : no sync after the loops otherwise
  parallelFor vi in 0 .. 12: # your iteration index needs to be assigned by Weave `+= 1` shared across threads will lead to undefined behavior 
    captures: {pvalues, psols}
    for j in i+1 .. 13 :
      for k in j+1 .. 14 :
       for l in k+1 .. 15 :
          var val_vi: int # local variables shouldn't be shared or you will have undefined behavior.
          var sol_vi: array[16,int]
          (val_vi,sol_vi) = Optimize(i,j,k,l)
          values[vi]= val_vi
          sols[vi]= sol_vi

For me, Weave looks much more complicated than the experimental 'parallel clause' which just works

Unfortunately it breaks down as soon as you need advanced things so it was a non-starter for me.

Furthermore, I have a AMD threadripper with 16 cores and 32 hardware threads. When Weave is running, I see that it "only" uses 16 threads. (If I compile the Linux kernel, e.g., I see 32 threads running)

This is strange, Weave has no logic to distinguish cores and threads it just uses Nim's countProcessors. For my use-cases it's actually beneficial to match the physical core count rather than the logical core count.

https://github.com/mratsim/weave/blob/e5a3701d59ec17c1c71a22a299ab526777054b44/weave/runtime.nim#L35-L46

HJarausch commented 3 years ago

Thanks Mamy! Unfortunately, I still understand some things. First, you write

parallelFor vi in 0 .. 12: # your iteration index needs to be assigned by Weave `+= 1` 
                                     # shared across threads will lead to undefined behavior 

In my original code

parallelFor i in 0 .. 12 :
    captures: {pvalues, psols}
    for j in i+1 .. 13 :
      for k in j+1 .. 14 :
       for l in k+1 .. 15 :
         ....
         inc(vi)

vi ranges from 0 to 1819. See my comment at the beginning

var C = 0
for i in 0 .. 12 :
  for j in i+1 .. 13 :
    for k in j+1 .. 14 :
      for l in k+1 .. 15 :
        inc(C)
echo C   # gives 1820

So, in general, the index vi can only be computed, it's not a simple loop variable. I see the problem - I have to think about how the code could be rewritten. Furthermore, there is a severe load balancing problem, since it is incremented 455 times for i==0 but only once for i==11. Probably it's not a good idea to use parallelFor here at all.

Furthermore, I assume that a var is missing in your code like

syncScope():  # EXTREMELY IMPORT : no sync after the loops otherwise
  parallelFor i in 0 .. 12 :
    captures: {pvalues, psols}
    for j in i+1 .. 13 :
      for k in j+1 .. 14 :
       for l in k+1 .. 15 :
          var   # <<<   H E R E
            val_vi: int # local variables shouldn't be shared or you will have undefined behavior.
            sol_vi: array[16,int]
          (val_vi,sol_vi) = Optimize(i,j,k,l)
          values[vi]= val_vi
          sols[vi]= sol_vi
          inc(vi)
mratsim commented 3 years ago

On load balancing

The load balancing isn't an issue for Weave you can use nested parallel for loops or insert loadBalance(Weave) calls in the inner loop. Weave was made to handle load imbalance and currently has the best load balancer of all multithreading frameworks I've tested against. The only one that may have a load balancer that come close is Manticore (https://github.com/ManticoreProject/manticore, http://manticore.cs.uchicago.edu/papers/jfp-lts-submitted.pdf)

I've explained visually Weave strategy here on a raytracing example: https://forum.nim-lang.org/t/6367#39266 image

Raytracing is quite interesting because a ray can require no work at all (touches a white wall), medium work or lot of work (rebound on mirrors or reflection on metal or glass). We assume 2 loops to schedule, one along height and another along width. (Doing nested parallel for loop in OpenMP is left as an exercise to the reader).

  1. A. Static scheduling (OpenMP). Each region of an image is assigned to a thread there is imbalance with region 2 being very very fast to process.
  2. B. Dynamic but eager scheduling and (OpenMP dynamic or guided and all other frameworks like TBB, Taskflow, Kokkos, HPX, StarPU, Chapel, .... When the compiler encounters a parallel for, the regions are recursively split into small tasks using some magic number called grainsize which represents the minimum amount of work required by tasks.\ The problem: the grainsize is impossible to assess for a multithreading framework, a grain size of 4 iterations would be way too small for an addition and way too big for some complex tasks.\ The solution: expose that to the developer, but how is the developer supposed to know on a scale of 1 to 1000 how much compute power does their task require? So they don't tune it.\ Furthermore this forgets that a CPU isn't busy with a process and even if the initial split was perfect, load imbalance can be introduced by your code editor running some kind of autocompletion for example.
  3. C. Dynamic, lazy and adaptative scheduling. Weave and Manticore. The loop is split if and only if a thread actually run out of work, meaning the user has absolutely no worry about load imbalance. The splitting can only happen at parallelFor calls or loadBalance(Weave) calls.

This is what I do in the raytracer: https://github.com/mratsim/weave/blob/f41a5626880e55600f57eb1dbf5c29ae30e8cfb2/demos/raytracing/smallpt.nim#L271-L282

Beyond the adaptative loop splitting, Weave is also able to automatically balance how much tasks should be transferred between threads so that small tasks are actually sent in bulk.

So if load balancing is an issue it's a bug in Weave I'm very interested to improve on.


On the rest

yes a var was missing, I've edited the code.

I see know the values vi get. For that loop to be parallelizable there needs to be a function that does

for i in 0 .. 12 :
  for j in i+1 .. 13 :
    for k in j+1 .. 14 :
      for l in k+1 .. 15 :
        let vi = deriveOffset(i, j, k, l)

so that vi becomes stateless.

HJarausch commented 3 years ago

Hi Mamy, meanwhile, I have recognized, that my problem is a Reduction Problem : computing the optimal goal and corresponding optimal object of a discrete optimization problem (by a heuristic algorithm like Ant Colony Optimization) The above example is simpler, it just solves a tiny problem by enumeration.

If you're interested (comments are more than welcome) I have put the new version of the above problem on Nim's playground. https://play.nim-lang.org/#ix=2SmL

Next I'll try to use several nested parallelForStaged loops.

I hope, parallelForStaged stays (AFAIK, it's experimental just now)

Thanks again for you help, Helmut

HJarausch commented 3 years ago

One more question: (How) Does awaitable work in nested loops?

In the following example, sync(MyLoop) does not wait until all loops are finished. What am I missing?

parallelForStaged I in 0 .. 12 :
  awaitable: MyLoop
  captures: {BestSolPtr}
  prologue:
    discard
  loop:
    parallelForStaged J in I+1 .. 13 :
      captures: {BestSolPtr,I}
      prologue:
        discard
      loop:
        parallelForStaged K in J+1 .. 14 :
          captures: {BestSolPtr,I,J}
          prologue:
           discard
          loop:
            parallelForStaged L in K+1 .. 15 :
              captures: {BestSolPtr,I,J,K}
              prologue:
                discard
              loop:
                Optimize(I,J,K,L)
              epilogue:
                discard
          epilogue:  
            discard  
      epilogue:  
        discard  
  epilogue:
    discard

discard sync(MyLoop)  # DOES NOT WAIT
mratsim commented 3 years ago

I hope, parallelForStaged stays (AFAIK, it's experimental just now)

it will stay, but the API may evolve if it's not ergonomic enough.

In the following example, sync(MyLoop) does not wait until all loops are finished. What am I missing?

The outer loop only awaits for this part to be done:

  loop:
    parallelForStaged J in I+1 .. 13 :
      captures: {BestSolPtr,I}
      prologue:
        discard
      loop:
        ... # other nested loops inside
      epilogue:  
        discard  

And that part is only enqueueing tasks in the runtime, which can be done really quickly. And once the tasks are created you exit the sync as there is no requirement to await for the "grand-children" tasks.

Hence instead if you want to await for tasks, children tasks, grand-children tasks and the whole ancestry you need to use a syncScope here

syncScope:
  parallelForStaged I in 0 .. 12 :
    captures: {BestSolPtr}
    prologue:
      discard
    loop:
      parallelForStaged J in I+1 .. 13 :
        captures: {BestSolPtr,I}
        prologue:
          discard
        loop:
          parallelForStaged K in J+1 .. 14 :
            captures: {BestSolPtr,I,J}
            prologue:
            discard
            loop:
              parallelForStaged L in K+1 .. 15 :
                captures: {BestSolPtr,I,J,K}
                prologue:
                  discard
                loop:
                  Optimize(I,J,K,L)
                epilogue:
                  discard
            epilogue:  
              discard  
        epilogue:  
          discard  
    epilogue:
      discard
HJarausch commented 3 years ago

Many thanks, Helmut