Closed HJarausch closed 3 years ago
You should send a pointer to the array not the full array. Even if I allow having a larger data buffer, your code will update a copy and not the original buffer so changes won't be propagated.
Unfortunately there is no way to capture an openarray to send across threads so we need to resort to pointers (and it's unsafe if you can return from a function without ensuring that the spawned task is finished).
In C OpenMP, there is no need for that because arrays == pointers which is not the case in Nim.
Sorry, I have to admit that I don't understand the logic of Weave, at all. Now I have changed the parallel loop to
var
vi = 0
values : array[1820,int]
sols : array[1820,array[16,int]]
val_vi : int
sol_vi : array[16,int]
syncScope(): # EXTREMELY IMPORT : no sync after the loops otherwise
parallelFor i in 0 .. 12 :
captures: {vi,val_vi,sol_vi}
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
(val_vi,sol_vi) = Optimize(i,j,k,l)
values[vi]= val_vi
sols[vi]= sol_vi
inc(vi)
and now I get Error: 'val_vi' cannot be assigned to
- why that?
Furthermore, I have a AMD threadripper with 16 cores and 32 hardware threads. When Weave is running, I see that it "only" uses 16 threads. (If I compile the Linux kernel, e.g., I see 32 threads running)
Thanks for your patients. For me, Weave looks much more complicated than the experimental 'parallel clause' which just works
Your val_vi
and sol_vi
should be unique per thread otherwise all threads will try to update those shared variables and you will get undefined results, this would be true in OpenMP as well.
So your code needs to be rewritten like so:
var
values : array[1820,int]
sols : array[1820,array[16,int]]
let pvalues = values.addr
let psols = sols.addr
syncScope(): # EXTREMELY IMPORT : no sync after the loops otherwise
parallelFor vi in 0 .. 12: # your iteration index needs to be assigned by Weave `+= 1` shared across threads will lead to undefined behavior
captures: {pvalues, psols}
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
var val_vi: int # local variables shouldn't be shared or you will have undefined behavior.
var sol_vi: array[16,int]
(val_vi,sol_vi) = Optimize(i,j,k,l)
values[vi]= val_vi
sols[vi]= sol_vi
For me, Weave looks much more complicated than the experimental 'parallel clause' which just works
Unfortunately it breaks down as soon as you need advanced things so it was a non-starter for me.
Furthermore, I have a AMD threadripper with 16 cores and 32 hardware threads. When Weave is running, I see that it "only" uses 16 threads. (If I compile the Linux kernel, e.g., I see 32 threads running)
This is strange, Weave has no logic to distinguish cores and threads it just uses Nim's countProcessors. For my use-cases it's actually beneficial to match the physical core count rather than the logical core count.
Thanks Mamy! Unfortunately, I still understand some things. First, you write
parallelFor vi in 0 .. 12: # your iteration index needs to be assigned by Weave `+= 1`
# shared across threads will lead to undefined behavior
In my original code
parallelFor i in 0 .. 12 :
captures: {pvalues, psols}
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
....
inc(vi)
vi
ranges from 0 to 1819. See my comment at the beginning
var C = 0
for i in 0 .. 12 :
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
inc(C)
echo C # gives 1820
So, in general, the index vi can only be computed, it's not a simple loop variable. I see the problem - I have to think about how the code could be rewritten. Furthermore, there is a severe load balancing problem, since it is incremented 455 times for i==0 but only once for i==11. Probably it's not a good idea to use parallelFor here at all.
Furthermore, I assume that a var is missing in your code like
syncScope(): # EXTREMELY IMPORT : no sync after the loops otherwise
parallelFor i in 0 .. 12 :
captures: {pvalues, psols}
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
var # <<< H E R E
val_vi: int # local variables shouldn't be shared or you will have undefined behavior.
sol_vi: array[16,int]
(val_vi,sol_vi) = Optimize(i,j,k,l)
values[vi]= val_vi
sols[vi]= sol_vi
inc(vi)
The load balancing isn't an issue for Weave you can use nested parallel for loops or insert loadBalance(Weave)
calls in the inner loop.
Weave was made to handle load imbalance and currently has the best load balancer of all multithreading frameworks I've tested against. The only one that may have a load balancer that come close is Manticore (https://github.com/ManticoreProject/manticore, http://manticore.cs.uchicago.edu/papers/jfp-lts-submitted.pdf)
I've explained visually Weave strategy here on a raytracing example: https://forum.nim-lang.org/t/6367#39266
Raytracing is quite interesting because a ray can require no work at all (touches a white wall), medium work or lot of work (rebound on mirrors or reflection on metal or glass). We assume 2 loops to schedule, one along height and another along width. (Doing nested parallel for loop in OpenMP is left as an exercise to the reader).
grainsize
which represents the minimum amount of work required by tasks.\
The problem: the grainsize
is impossible to assess for a multithreading framework, a grain size of 4 iterations would be way too small for an addition and way too big for some complex tasks.\
The solution: expose that to the developer, but how is the developer supposed to know on a scale of 1 to 1000 how much compute power does their task require? So they don't tune it.\
Furthermore this forgets that a CPU isn't busy with a process and even if the initial split was perfect, load imbalance can be introduced by your code editor running some kind of autocompletion for example.parallelFor
calls or loadBalance(Weave)
calls.This is what I do in the raytracer: https://github.com/mratsim/weave/blob/f41a5626880e55600f57eb1dbf5c29ae30e8cfb2/demos/raytracing/smallpt.nim#L271-L282
Beyond the adaptative loop splitting, Weave is also able to automatically balance how much tasks should be transferred between threads so that small tasks are actually sent in bulk.
So if load balancing is an issue it's a bug in Weave I'm very interested to improve on.
yes a var
was missing, I've edited the code.
I see know the values vi get. For that loop to be parallelizable there needs to be a function that does
for i in 0 .. 12 :
for j in i+1 .. 13 :
for k in j+1 .. 14 :
for l in k+1 .. 15 :
let vi = deriveOffset(i, j, k, l)
so that vi
becomes stateless.
Hi Mamy, meanwhile, I have recognized, that my problem is a Reduction Problem : computing the optimal goal and corresponding optimal object of a discrete optimization problem (by a heuristic algorithm like Ant Colony Optimization) The above example is simpler, it just solves a tiny problem by enumeration.
If you're interested (comments are more than welcome) I have put the new version of the above problem on Nim's playground. https://play.nim-lang.org/#ix=2SmL
Next I'll try to use several nested parallelForStaged loops.
I hope, parallelForStaged stays (AFAIK, it's experimental just now)
Thanks again for you help, Helmut
One more question: (How) Does awaitable work in nested loops?
In the following example, sync(MyLoop) does not wait until all loops are finished. What am I missing?
parallelForStaged I in 0 .. 12 :
awaitable: MyLoop
captures: {BestSolPtr}
prologue:
discard
loop:
parallelForStaged J in I+1 .. 13 :
captures: {BestSolPtr,I}
prologue:
discard
loop:
parallelForStaged K in J+1 .. 14 :
captures: {BestSolPtr,I,J}
prologue:
discard
loop:
parallelForStaged L in K+1 .. 15 :
captures: {BestSolPtr,I,J,K}
prologue:
discard
loop:
Optimize(I,J,K,L)
epilogue:
discard
epilogue:
discard
epilogue:
discard
epilogue:
discard
discard sync(MyLoop) # DOES NOT WAIT
I hope, parallelForStaged stays (AFAIK, it's experimental just now)
it will stay, but the API may evolve if it's not ergonomic enough.
In the following example, sync(MyLoop) does not wait until all loops are finished. What am I missing?
The outer loop only awaits for this part to be done:
loop:
parallelForStaged J in I+1 .. 13 :
captures: {BestSolPtr,I}
prologue:
discard
loop:
... # other nested loops inside
epilogue:
discard
And that part is only enqueueing tasks in the runtime, which can be done really quickly. And once the tasks are created you exit the sync
as there is no requirement to await for the "grand-children" tasks.
Hence instead if you want to await for tasks, children tasks, grand-children tasks and the whole ancestry you need to use a syncScope
here
syncScope:
parallelForStaged I in 0 .. 12 :
captures: {BestSolPtr}
prologue:
discard
loop:
parallelForStaged J in I+1 .. 13 :
captures: {BestSolPtr,I}
prologue:
discard
loop:
parallelForStaged K in J+1 .. 14 :
captures: {BestSolPtr,I,J}
prologue:
discard
loop:
parallelForStaged L in K+1 .. 15 :
captures: {BestSolPtr,I,J,K}
prologue:
discard
loop:
Optimize(I,J,K,L)
epilogue:
discard
epilogue:
discard
epilogue:
discard
epilogue:
discard
Many thanks, Helmut
The attached programme gives
How to enlarge the parallel tasks data buffer ?
Where comes 144 from?
I haven't found anything in the documentation. Thanks for hint, Helmut Count_WV.nim.txt