lf-lang / reactor-rs

Reactor runtime implementation in Rust
MIT License
9 stars 5 forks source link

Parallel runtime fails to schedule actions sometimes #4

Closed jhaye closed 2 years ago

jhaye commented 2 years ago

I've been working on implementing the MatMul benchmark, and have come across a bug, where an action isn't scheduled, even though it should be. The important reaction is here. When data is set on line 93, the connected reactions in Worker trigger, but the reaction to next seems to never trigger, even though it is scheduled at the end of the same reaction.\ This problem disappears when I remove the parallel-runtime feature.

Here is the execution trace:

[2021-12-17T11:35:38Z INFO ] Assembling runner
[2021-12-17T11:35:38Z INFO ] Assembling manager
[2021-12-17T11:35:38Z INFO ] Assembling workers
[2021-12-17T11:35:38Z INFO ] Registering workers
[2021-12-17T11:35:38Z INFO ] Registering manager
[2021-12-17T11:35:38Z INFO ] Registering runner
[2021-12-17T11:35:38Z INFO ] Triggering startup...
[2021-12-17T11:35:38Z TRACE]   - Level 1
[2021-12-17T11:35:38Z TRACE]   - Executing /runner/0
[2021-12-17T11:35:38Z TRACE]   - Executing /0
Benchmark: MatMulBenchmark
Arguments:
            numIterations = 12
               dataLength = 1024
[2021-12-17T11:35:38Z TRACE]   - Executing /manager/0
           blockThreshold = 16384
               priorities = 10
               numWorkers = 20
System information:
                 O/S Name = Linux
[2021-12-17T11:35:38Z TRACE] Pushing at (T0 + 0 ns = 0 ms, 1): run [2: {/runner/1}]
[2021-12-17T11:35:38Z TRACE] Processing event at (T0 + 0 ns = 0 ms, 1): run [2: {/runner/1}]
[2021-12-17T11:35:38Z TRACE]   - Running late by 136662519 ns = 136662 µs = 136 ms
[2021-12-17T11:35:38Z TRACE]   - Level 2
[2021-12-17T11:35:38Z TRACE]   - Executing /runner/1 (level 2)
[2021-12-17T11:35:38Z TRACE]   - Level 5
[2021-12-17T11:35:38Z TRACE]   - Executing /manager/1 (level 5)
[2021-12-17T11:35:38Z TRACE]   - Level 8
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[15]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[8]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[3]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[4]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[17]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[2]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[9]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[16]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[0]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[19]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[5]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[13]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[12]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[11]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[14]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[6]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[18]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[10]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[1]/0
[2021-12-17T11:35:38Z TRACE]   - Executing /workers[7]/0
[2021-12-17T11:35:38Z TRACE] Will wait for asynchronous event without timeout
[2021-12-17T11:35:38Z INFO ] Event queue is empty forever, shutting down.
[2021-12-17T11:35:38Z INFO ] Scheduler is shutting down, at (T0 + 137417726 ns = 137 ms, 0)
[2021-12-17T11:35:38Z INFO ] Scheduler has been shut down
jhaye commented 2 years ago

This is also a problem with the newly ported GuidedSearch benchmark. Again, things work fine without the parallel-runtime feature.

Here is the execution trace:

[2021-12-22T13:17:18Z INFO ] Assembling manager
[2021-12-22T13:17:18Z INFO ] Assembling runner
[2021-12-22T13:17:18Z INFO ] Assembling workers
[2021-12-22T13:17:18Z INFO ] Registering workers
[2021-12-22T13:17:18Z INFO ] Registering runner
[2021-12-22T13:17:18Z INFO ] Registering manager
[2021-12-22T13:17:18Z INFO ] Triggering startup...
[2021-12-22T13:17:18Z TRACE]   - Level 1
[2021-12-22T13:17:18Z TRACE]   - Executing /0
Benchmark: GuidedSearchBenchmark
[2021-12-22T13:17:18Z TRACE]   - Executing /manager/0
Arguments:
            numIterations = 12
                threshold = 1024
[2021-12-22T13:17:18Z TRACE]   - Executing /runner/0
                 gridSize = 30
               priorities = 30
               numWorkers = 20
System information:
                 O/S Name = Linux
[2021-12-22T13:17:18Z TRACE]   - Level 4
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[0]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[12]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[6]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[1]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[7]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[11]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[10]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[4]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[16]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[18]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[5]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[14]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[19]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[3]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[13]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[15]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[8]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[9]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[17]/0
[2021-12-22T13:17:18Z TRACE]   - Executing /workers[2]/0
[2021-12-22T13:17:18Z TRACE] Will wait for asynchronous event without timeout
[2021-12-22T13:17:18Z INFO ] Event queue is empty forever, shutting down.
[2021-12-22T13:17:18Z INFO ] Scheduler is shutting down, at (T0 + 1823966 ns = 1 ms, 0)
[2021-12-22T13:17:18Z INFO ] Scheduler has been shut down
jhaye commented 2 years ago

Same problem with the newly ported NQueens benchmark.

jhaye commented 2 years ago

I've investigated this a bit. The actions are genuinely not scheduled, as opposed to being scheduled and never executed.

When a tag is processed, afterwards all future events caused by it get added to the event queue, which suggests that the list of future events is empty. We know these events aren't pushed, because the push_event macro would write a trace log.

Since the bug only occurs with the parallel-runtime feature enabled, it seems like the culprit is somewhere in parallel_rt_impl::process_batch(…).

I wasn't able to dig further yet.

cmnrd commented 2 years ago

Thanks for taking a look! Sounds like you got a good lead. From a quick glance, I would suspect that something in the unsafe code block isn't doing what it is supposed to do... but I don't understand it well enough either.

jhaye commented 2 years ago

From what I can tell the unsafe block is fine. Right now I'm investigating the merge logic after parallel execution of the iterator. The code is very dense and sparsely documented, so I'll work on understanding and documenting for now.

oowekyala commented 2 years ago

Hint: at the following line:

https://github.com/lf-lang/reactor-rust/blob/dd7edda6f81ec1e1fc734b2d5b94993ee6ecb121/src/scheduler/scheduler_impl.rs#L493

The insides are replaced, even though they may contain events from earlier batches processed at the current logical time, from what I can tell.

I think this should logically be ctx.insides = RContextForwardableStuff::merge(ctx.insides, /* reduce expression result */);

but you need to change the signature of merge, or use a new function that merges in place with a mutable reference, as you can't move ctx.insides out of the ctx ref.

jhaye commented 2 years ago

Thank you, that's exactly the hint I needed!