`verify/pass/arrays/{merge,selection_sort}.rs` timeout issue

Pointerbender commented 2 years ago

On commit 7e259b8 from PR #817 the test case prusti-tests/tests/verify/pass/arrays/selection_sort.rs is reliably failing due to a timeout error during the CI run on Ubuntu. This timeout issue is not reproducible on the Windows or Mac CI runs, nor on my local Ubuntu machine, where the test case is verifying okay. @vakaras pointed out that such a performance issue could very well be due to matching loop.

There are a couple of things we could try to investigate this, per the instructions from @fpoli:

Typically such large performance variations are caused by matching loops. @fpoli If I am not mistaken you had used AxiomProfiler recently, would you mind adding some brief instructions on how to use it?

I didn't; my logs were too big (my timeout was 30 minutes) and the tool kept crashing on Linux. However, this failing test in Prusti seems much smaller than that. The readme has instructions on how to run the tool and there is a working Docker image that can be used even from Mac. To use the profiler, see here; they link to the paper that describes the tool. The "Help" menu of the tool also contains useful instructions.

Alternatively, Z3's smt.qi.profile=true flag is very promising. Its output is explained here and with a bit of processing I used it to see how many times each quantifier is instantiated. You need to obtain first from Silicon the smt2 file with the commands sent to Z3, making sure that you can reproduce the performance issue just by running Z3 on it.

After tracking down and fixing the performance issue, we should re-enable the prusti-tests/tests/verify/pass/arrays/selection_sort.rs test case, as it was ignored for now until this is resolved.

Pointerbender commented 2 years ago

I also observed selection_sort.rs failing spuriously in the wild on Mac OS X when fetching the latest changes on master from upstream to my fork of prusti-dev. This particular commit doesn't have the FxHash changes yet, so it's probably not reproducible there: https://github.com/Pointerbender/prusti-dev/runs/4737295553?check_suite_focus=true#step:8:707

[Prusti: verification error] loop invariant might not hold after a loop iteration that preserves the loop condition.
thread 'main' panicked at 'Prusti failed to finish within 180 seconds. It finished in 526.228013589s.', prusti/src/driver.rs:227:9

fpoli commented 2 years ago

The same for pass/arrays/merge.rs on Mac: Prusti failed to finish within 500 seconds. It finished in 521.787225005s. (job)

fpoli commented 2 years ago

Where is the timeout set? Edit: here.

Pointerbender commented 2 years ago

I have a running theory of what might be happening: I think the fix_quantifiers and/or the optimize_folding passes might be altering the user-specified quantifiers in a way that makes it hard for Viper to infer the correct triggers (since the user-specified quantifiers in the test case don't have explicit triggers, Viper will attempt to infer these automatically).

Take for example the post-condition on line 65:

https://github.com/viperproject/prusti-dev/blob/f7f3e1b27e678c28c078c1be0796ab974dfc7c62/prusti-tests/tests/verify/pass/arrays/selection_sort.rs#L60-L75

If I run the test in debug mode with the fix_quantifiers and/or the optimize_folding optimizations disabled, the whole test compilation finishes successful in ~58 seconds, of which the Viper verification of method selection_sort takes less than 20 seconds (in release mode the entire compilation takes 22 seconds of which selection_sort takes only 11-12 seconds). The post-condition #[ensures(forall(|k1: usize| (0 <= k1 && k1 < 10 && k1 != i && k1 != min) ==> (a[k1] == old(a[k1]))))] is translated into the Viper code:

forall _2_quant_101: Int ::
    0 <= _2_quant_101 && _2_quant_101 <= 18446744073709551615
    ==>
      0 <= _2_quant_101 && (
        _2_quant_101 < 10 && (
          !(_2_quant_101 == old[l38](_89))
          && !(_2_quant_101 == old[l38](_90))
        )
      ) ==>
        _2_quant_101 < 10 && (
          _2_quant_101 < 10
          && read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(old[l38](_87.val_ref)), _2_quant_101)
          == old[l38](read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_87.val_ref), _2_quant_101))
        )

This is already somewhat tricky, because we don't know for sure (without profiling it first) whether Viper will infer the triggers for both:

{ read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(old[l38](_87.val_ref)), _2_quant_101) }
{ read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_87.val_ref), _2_quant_101) }

If I manually add both triggers in the dumped Viper file, it verifies fast in Viper IDE in about 13 seconds on a cold start. If I only add the latter trigger then Viper IDE takes much longer to verify (it timed out after 300 seconds). So at least empirically it looks like Viper correctly infers both triggers in this case (profiling will have to tell us for sure).

It gets a lot worse when we enable the 2 optimizations again, now the Viper verification of method selection_sort (through Prusti, which uses a fixed Z3 random seed) takes almost 120 seconds! The same post-condition now turns into the following Viper code:

inhale (let _LET_41 == (old[l38](_87.val_ref)) in (let _LET_40 == (old[l38](_90)) in (let _LET_39 == (old[l38](_89)) in (
forall _2_quant_101: Int ::
    0 <= _2_quant_101
    && _2_quant_101 <= 18446744073709551615
    ==>
      0 <= _2_quant_101
      && (
        _2_quant_101 < 10 && (
          !(_2_quant_101 == _LET_39)
          && !(_2_quant_101 == _LET_40)
        )
      ) ==>
        _2_quant_101 < 10 && (
          _2_quant_101 < 10
          && read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_LET_41), _2_quant_101)
          == old[l38](read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_87.val_ref), _2_quant_101))
        )
// .....

Which triggers could Viper possibly choose here out of the possibilities:

//  #[ensures(forall(|k1: usize| (0 <= k1 && k1 < 10 && k1 != i && k1 != min) ==> (a[k1] == old(a[k1]))))] 
forall _2_quant_101: Int ::
    { read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_LET_41), _2_quant_101) }
    { read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(old[l38](_87.val_ref)), _2_quant_101) }
    { read$Snap$Array$10$i32$__$TY$__(snap$__$TY$__Snap$Array$10$i32$(_87.val_ref), _2_quant_101) }

Would the choice matter a lot for performance? Would this also matter in Viper assert and exhale statements, or only for inhale statements?

(I didn't do any profiling yet, I was just trying to eyeball it and then noticed the performance difference when disabling those optimizations).

fpoli commented 2 years ago

It seems to me that the transformation from the first to the second quantifier that you reported is doing a lot of unnecessary stuff. I don't see why we would need to move those old expressions out of the quantifier. If I recall correctly those passes are a workaround for an incompleteness of Viper when the old expression is used to identify some acc(..) permission inside a quantifier (e.g. forall ... :: unfolding acc(P(old[..](..))) in ...). But when the type of the Viper expression is a value (Int or domain instance) we shouldn't be able to hit that incompleteness, so the transformation shouldn't be needed.

If triggers are the cause, I think the proper fix for that example would be to add the triggers {a[k1]}{old(a[k1])} on line 65, so that Prusti should generate the first two triggers that you wrote. However, this might still be blocked by https://github.com/viperproject/prusti-dev/issues/812.

Which triggers could Viper possibly choose here out of the possibilities:

It's always hard to answer this. I opened https://github.com/viperproject/silicon/issues/586.

Would the choice matter a lot for performance?

The first two should be equivalent because the verifier should know _LET_41 == old[l38](_87.val_ref), but the last one is not. So, the choice should still matter, but it depends a lot on what's in the rest of the program.

Would this also matter in Viper assert and exhale statements, or only for inhale statements?

Yes, because assert and exhale assume the property when the check succeeds.

vakaras commented 2 years ago

I think the fix_quantifiers and/or the optimize_folding passes might be altering the user-specified quantifiers in a way that makes it hard for Viper to infer the correct triggers

Just a heads-up: these passes should be gone once I finish my refactorings.

(since the user-specified quantifiers in the test case don't have explicit triggers, Viper will attempt to infer these automatically)

From our Viper experience, we concluded that the user should always write the triggers because inferring the correct ones is impossible. So, if it is clear what triggers to write, we should just add them to the example.

Pointerbender commented 2 years ago

In that case I think the best way forward to fix this performance issue is to add triggers to the test case once #812 is resolved and @vakaras' refactorings land. I think that should solve it and then we won't need to do the profiling.

In another comment I alluded to an idea about detecting potential matching loops automatically. I will post a more detailed description later, but at a very high level I'm thinking of two things:

Emit a compiler warning when a potential matching loop is detected (this would be a step before running the Viper program). This will also warn the user if the matching loop is not actually triggered.
Let Prusti infer the default triggers when the user omits these, so that it's always clear which triggers are used by looking at the Viper program (in lieu of viperproject/silicon#586). These triggers might not always be complete, but that is the price to pay for users when they choose to omit them :P If the auto-inferred trigger(s) could lead to a (potential) matching loop, this is a compiler error that tells the user to add another quantified variable and/or specify the triggers explicitly manually.

vakaras commented 2 years ago

In that case I think the best way forward to fix this performance issue is to add triggers to the test case once #812 is resolved and @vakaras' refactorings land. I think that should solve it and then we won't need to do the profiling.

I completely agree.

In another comment I alluded to an idea about detecting potential matching loops automatically. I will post a more detailed description later, but at a very high level I'm thinking of two things:

Emit a compiler warning when a potential matching loop is detected (this would be a step before running the Viper program). This will also warn the user if the matching loop is not actually triggered.

Let Prusti infer the default triggers when the user omits these, so that it's always clear which triggers are used by looking at the Viper program (in lieu of viperproject/silicon#586). These triggers might not always be complete, but that is the price to pay for users when they choose to omit them :P If the auto-inferred trigger(s) could lead to a (potential) matching loop, this is a compiler error that tells the user to add another quantified variable and/or specify the triggers explicitly manually.

This would definitely improve the user experience. One more idea would be when the user omits triggers, show a warning and in the warning's note tell what triggers were inferred so that the user has some starting point.

Pointerbender commented 2 years ago

With PR #876 the selection_sort.rs was sped up a little further, but when playing around with quantifier triggers I discovered that code like this doesn't work fully yet:

while i < a.len() {
        // ... snipped code and loop body invariants from `selection_sort.rs`
        let a_i = a[i];
        let a_min = a[min];
        a[i] = a_min;
        a[min] = a_i;
}

But this does work currently:

while i < a.len() {
        // ... snipped code and loop body invariants from `selection_sort.rs`
        let a_i = a[i];
        let a_min = a[min];
        set(a, i,  a_min);
        set(a, min, a_i);
}

#[requires(0 <= i && i < 10)]
#[ensures(forall(|j: usize| (0 <= j && j < 10 && j != old(i)) ==> (a[j] == old(a[j])), triggers=[(a[j],)]))]
#[ensures(a[old(i)] == old(v))]
fn set(a: &mut [i32; 10], i: usize, v: i32) {
    a[i] = v;
}

I suspect that this is because in the first example the array assignments (= these do not use snap encoding) are not triggering the quantifiers from the loop body invariants (= these use snap encoding). In the second example the correct triggering assertions are contained in the post-conditions of set. These are snap encoded and thus trigger the quantifiers from the loop body invariant.

At the Viper level, this could probably be fixed by adding a call to snap$__$TY$__Snap$Array$10$i32$ after the last array assignment a[min] = a_i; at the end of the loop:

function snap$__$TY$__Snap$Array$10$i32$(self: Ref): Snap$Array$10$i32
  requires acc(Array$10$i32(self), read$())
  ensures (forall i: Int ::
         { read$Snap$Array$10$i32$__$TY$__(result, i) }
         { lookup_pure__$TY$__Array$10$i32$i32$(self, i) }
         0 <= i && i < 10 ==>
                  read$Snap$Array$10$i32$__$TY$__(result, i)
                  == lookup_pure__$TY$__Array$10$i32$i32$(self, i))
{
  cons$Snap$Array$10$i32$__$TY$__(seq_collect$Array$10$i32$__$TY$__(self, 0))
}

The function snap$__$TY$__Snap$Array$10$i32$ has the needed post-conditions to propagate the array assignment into the snap domain in order to satisfy the loop body invariants. Even though the loop body invariants already are making a call to snap$__$TY$__Snap$Array$10$i32$ at the Viper level, the problem seems to be that the array assignments are not preserved between loops (at the snap encoding level, at least), so on entering the next iteration of the loop it fails verification on the first body invariant which depends on the array a.

I was pondering on what it would take to be able to write the first example and have Prusti verify it okay. Some possibilities are:

Wait for #796 which addresses snap encoding of Rust assertions. Then we can add (debug_)assert! statements at the end of the loop which will invoke snap$__$TY$__Snap$Array$10$i32$. This one is currently blocked by the refactoring.
We could update the quantifier triggers for array/slice assignment to also trigger on { read$Snap$Array$10$i32$__$TY$__(X, i) }, currently those only trigger on { lookup_pure__$TY$__Array$10$i32$i32$(X, i). This would be nice for the end-user experience, because less Rust assertions would be needed. However, this may decrease overall performance because of the extra trigger and larger Viper programs when snap encoding is not needed for the verification.
Find another way to preserve information about the array assignment between loop iterations. Johannes also suggested this option in his master thesis in section "6.1 Future Work".

Would there be any merit to already start investigating one of these options before the refactoring lands? Are there any other interesting options that are not listed above? I would be interested in taking on exploring this further :)

Thanks!

vakaras commented 2 years ago

Thank you for investigating this! I suspect that the second option would be quite easy to implement and hopefully does not introduce too serious performance problems. So, I personally would vote for it.

viperproject / prusti-dev

`verify/pass/arrays/{merge,selection_sort}.rs` timeout issue #819