Closed jhaye closed 2 years ago
This is also a problem with the newly ported GuidedSearch benchmark. Again, things work fine without the parallel-runtime
feature.
Here is the execution trace:
[2021-12-22T13:17:18Z INFO ] Assembling manager
[2021-12-22T13:17:18Z INFO ] Assembling runner
[2021-12-22T13:17:18Z INFO ] Assembling workers
[2021-12-22T13:17:18Z INFO ] Registering workers
[2021-12-22T13:17:18Z INFO ] Registering runner
[2021-12-22T13:17:18Z INFO ] Registering manager
[2021-12-22T13:17:18Z INFO ] Triggering startup...
[2021-12-22T13:17:18Z TRACE] - Level 1
[2021-12-22T13:17:18Z TRACE] - Executing /0
Benchmark: GuidedSearchBenchmark
[2021-12-22T13:17:18Z TRACE] - Executing /manager/0
Arguments:
numIterations = 12
threshold = 1024
[2021-12-22T13:17:18Z TRACE] - Executing /runner/0
gridSize = 30
priorities = 30
numWorkers = 20
System information:
O/S Name = Linux
[2021-12-22T13:17:18Z TRACE] - Level 4
[2021-12-22T13:17:18Z TRACE] - Executing /workers[0]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[12]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[6]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[1]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[7]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[11]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[10]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[4]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[16]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[18]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[5]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[14]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[19]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[3]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[13]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[15]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[8]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[9]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[17]/0
[2021-12-22T13:17:18Z TRACE] - Executing /workers[2]/0
[2021-12-22T13:17:18Z TRACE] Will wait for asynchronous event without timeout
[2021-12-22T13:17:18Z INFO ] Event queue is empty forever, shutting down.
[2021-12-22T13:17:18Z INFO ] Scheduler is shutting down, at (T0 + 1823966 ns = 1 ms, 0)
[2021-12-22T13:17:18Z INFO ] Scheduler has been shut down
I've investigated this a bit. The actions are genuinely not scheduled, as opposed to being scheduled and never executed.
When a tag is processed, afterwards all future events caused by it get added to the event queue, which suggests that the list of future events is empty. We know these events aren't pushed, because the push_event macro would write a trace log.
Since the bug only occurs with the parallel-runtime
feature enabled, it seems like the culprit is somewhere in parallel_rt_impl::process_batch(…)
.
I wasn't able to dig further yet.
Thanks for taking a look! Sounds like you got a good lead. From a quick glance, I would suspect that something in the unsafe code block isn't doing what it is supposed to do... but I don't understand it well enough either.
From what I can tell the unsafe
block is fine. Right now I'm investigating the merge logic after parallel execution of the iterator. The code is very dense and sparsely documented, so I'll work on understanding and documenting for now.
Hint: at the following line:
The insides are replaced, even though they may contain events from earlier batches processed at the current logical time, from what I can tell.
I think this should logically be ctx.insides = RContextForwardableStuff::merge(ctx.insides, /* reduce expression result */);
but you need to change the signature of merge
, or use a new function that merges in place with a mutable reference, as you can't move ctx.insides
out of the ctx
ref.
Thank you, that's exactly the hint I needed!
I've been working on implementing the MatMul benchmark, and have come across a bug, where an action isn't scheduled, even though it should be. The important reaction is here. When
data
is set on line 93, the connected reactions inWorker
trigger, but the reaction tonext
seems to never trigger, even though it is scheduled at the end of the same reaction.\ This problem disappears when I remove theparallel-runtime
feature.Here is the execution trace: