herd / herdtools7

The Herd toolsuite to deal with .cat memory models (version 7.xx)
Other
215 stars 54 forks source link

[litmus] Add support for speedcheck parameter for -mode presi #869

Closed relokin closed 1 month ago

relokin commented 2 months ago

This change adds support for the speedcheck parameter for -mode presi. This was already supported in -mode std. The user can provide the parameter "+sc" which will force the exit to as soon as the post-condition is observed.

maranget commented 2 months ago

Hi @relokin, using one global stop_now flag may lead to deadlock. Commit b72a2f38ed279fdc45e4cb0131d10bcc4af85a68 is an attempt to avoid such deadlocks by stopping all instances before stopping the experiment.

relokin commented 1 month ago

Hey @maranget, indeed I missed this. I tried to avoid other deadlocks, but I missed this.

I had a look at your patch and I think it's a much better way to achieve what I was trying to. Do you want to open a new PR with your own patch or should I cherry pick it in this pull request?

maranget commented 1 month ago

Hi @relokin, cherry picking looks like the most adequate technique. Your opinion?

relokin commented 1 month ago

Just to check with you that my understanding is correct.

The high level desire was to exit litmus7 as soon as there is an execution where the post-condition is satisfied. With this PR, if a specific execution satisfies the post-condition, then other instances will continue for a little longer.

For example, let's say we execute 2 instances of MP using 4 cores. The 1st instance observes an execution which satisfies the post-condition and exits immediately. The 2nd instance will continue executing until the end of the for loop (executing in total size iterations), or until it encounters itself an execution which satisfies the post-condition).

However, at the end of the execution, we know that at least for one instance its last execution satisfied the post-condition. And in the case of presi, we know that at least for at least a set of cores (2 in the case of MP), the last time they executed the test, it satisfied the post-condition.

Does that make sense?

maranget commented 1 month ago

The high level desire was to exit litmus7 as soon as there is an execution where the post-condition is satisfied. With this PR, if a specific execution satisfies the post-condition, then other instances will continue for a little longer.

Hi @relokin, we agree on the high-level desire. If I am not mistaken, this is what the synchronisation code of commit 0b45377 does. That is all instance will exit as soon as possible if one instance discovers that the post-condition is satisfied.

Every test thread executes nruns times (function choose) a sequence of size tests (function choose_params).

As soon as one of the instance discovers that the post-condition is satisfied, it sets the global flag stop_now to true. Moreover all the threads of any instance synchronise as follows: thread number zero copies the global flag into an instance level flag and all instance thread synchronise with an instance level synchronisation barrier before they read the instance level flag. If this flag is set, all threads exit the loop by returning from the choose_params function. As a consequence, all thread of all instances will exit their loops as soon as possible and return inside the loop of size nruns in the function choose. There they all synchronise on a global synchronisation barrier before reading the global flag and all exit if they see it set,

I am not sure the scheme above is dead-lock free. It looks important that all the threads of a given instance act consistently. Hence the idea of them synchronising before reading the instance level flag.

maranget commented 1 month ago

A simpler scheme that would not stop threads as soon as the previous one, but that would spare the instance level synchronisation, would be as follows: the choose_param function (loop on size) simply records the occurrence of the stop condition locally, returning 1 if the stop condition occurred for some of the loop iteration and 0 otherwise.

In choose, if the returned value is one, set the global stop_now flag to one. Then synchronise on a global sync barrier, before reading the global flag and exiting when set.

relokin commented 1 month ago

A simpler scheme that would not stop threads as soon as the previous one, but that would spare the instance level synchronisation, would be as follows: the choose_param function (loop on size) simply records the occurrence of the stop condition locally, returning 1 if the stop condition occurred for some of the loop iteration and 0 otherwise.

Thanks @maranget! The motivation for this change (and I am aware that this might not be the same for speedcheck in -mode std) was to make it easier to identify the execution that satisfied the post-condition. So it's quite important that we exit as soon as possible. So I would rather not change it, unless ofc, there is something wrong with the current approach.

maranget commented 1 month ago

Hi @relokin. I guess we agree to merge this PR as it is, or do you see additional improvements?

maranget commented 1 month ago

If we agree on merging, would you please rebase on master?

relokin commented 1 month ago

I am happy for this to be merged. Thanks @maranget!

maranget commented 1 month ago

Merged, thanks @relokin