Instability in test case, which disappears when removing unrelated program elements

bobismijnnaam commented 4 years ago

In the VerCors testsuite we have a test that's unstable (several actually, but I'm talking about this one). I've tried minimising the Silver output of VerCors to figure out what's causing the instability. I've gotten to the point where I can consistently get passes/fails for a 400-line minimised version of the original output. But, it seems that simplifying the file any further gets rid of the instability. I've tried removing unrelated/unused methods/axioms/fields (which I think should not matter for the difficulty of proving the unstable method call, but z3 can act strangely I guess), and even after 2500 runs I still can only get passes.

It seems the problem with the program is not so much any particular construct, but more the volume of methods+fields+axioms...? Unless I've overlooked it there is no non-linear arithmetic in there. There are a lot of foralls in there which might be related...?

I would like to do more debugging/minimising, but my standard approach of minimising does not get me further than this. I would be very grateful if someone from the viper team could have a quick look if there are any minimisation opportunities that I've missed/don't know about, or if someone could have a quick look if there is something weird taking place inside Viper/Z3.

The file can be found here: elect_min_viper_master.zip. I've marked the problematic methods/method calls with PROBLEMATIC comments. I can also supply my parallel runner script if someone else wants to do some testing.

(cc: @niomaster)

Example output of a detected instability:

-- Passes: 16, fails: 0, time-outs: 0 (cutoff at 100) --
-- Command: /home/bobe/UNSAFE/Tools/viper/silicon/silicon.sh --z3Exe /home/bobe/.nix-profile/bin/z3  elect_min_viper_master.sil --
-- Overview durations 2020-08-30 20:16:46.970859 --
pps[0]: 25.073548793792725s
pps[1]: 25.83439612388611s
pps[2]: 25.735045909881592s
pps[3]: 26.91899585723877s
pps[4]: 27.60429286956787s
pps[5]: 28.785851001739502s
pps[6]: 24.99570345878601s
pps[7]: 26.027161836624146s

One passed! Output of process:
Silicon 1.1-SNAPSHOT (d45da1d7+)

Silicon finished verification successfully in 36.25s.

Restarting pps[5]
One passed! Output of process:
Silicon 1.1-SNAPSHOT (d45da1d7+)

Silicon finished verification successfully in 38.03s.

Restarting pps[4]
One failed! Output of process:
Silicon 1.1-SNAPSHOT (d45da1d7+)

Silicon found 1 error in 37.98s:
  [0] The precondition of method method_Program_lemma_sigmaRecv_choice__Future__Integer__Integer__Integer__Integer__Integer might not hold. There might be insufficient permission to access hist_idle(diz.field_Program_f, q2, p_method_Future_SigmaRecv__Integer__Integer__Integer__Integer__Integer__Integer(rank, diz.field_Program_size, v, diz.field_Program_maxvalue - 1, diz.field_Program_maxvalue, n)). (elect_min_viper_master.sil@418.5)

Restarting pps[3]
Encountered 18 passes, 1 fails, and 0 time-outs
Search took 134s

My z3 version is 4.8.7. I have no changes made to my local checkout of silicon (the + is just there because of some temporary files in the silicon dir)

mschwerhoff commented 4 years ago

Debugging these kinds of performance issues is unfortunately very difficult. Given all the quantifiers, it could be that there is a matching loop (cycle in the instantiation graph), or that the instantiation graph is "just" very wide or deep. In either case, small changes in the SMT input, e.g. renamed variables or different order of conjuncts, can affect Z3's heuristics and cause it to explore possible instantiations in different orders, thus affecting performance. Different Z3 versions are also likely to explore different parts of the search space, given the same input file.

Such changes to the SMT output can be caused by Silicon's internal parallelisation; using --numberOfParallelVerifiers 1 will disable it. Please let me know if you still observe performance fluctuations afterwards — if so, maybe you could post the different SMT files here as well.

Back to debugging: assuming the problem is due to quantifier instantiations, then the best way of peeking inside Z3 is to use the Axiom Profiler. It isn't an easy task, though. Maybe @wytseoortwijn can help you with it, he has used the tool a couple of times in the last year, if I'm not mistaken.

bobismijnnaam commented 4 years ago

Thank you for the explanation.

Adding --numberOfParallelVerifiers 1 makes my minimised example stable (at least, it does not fail after 100 runs - usually this file would fail to verify after 20 runs on average). However, the complete elect.pvl file is still unstable even with the --numberOfParallelVerifiers flag.

When I use the flag, two smt files are produced (00 and 01). 00 is identical for both a passing and a failing run (is this the log of controlling Z3 from Viper's side?). 01-fail.smt and 01-succeeding.smt2 differ for about 3000 lines. I've included the files but I reckon they are too big to be useful (around 1.5MB for the 01.smt2 files): smtlogs.zip. I'll have a look to see how far I can minimise the unstable file in combination with the --numberOfParallelVerifiers flag.

mschwerhoff commented 4 years ago

However, the complete elect.pvl file is still unstable even with the --numberOfParallelVerifiers flag.

Could you please post that file's Viper program as well?

When I use the flag, two smt files are produced (00 and 01). 00 is identical for both a passing and a failing run (is this the log of controlling Z3 from Viper's side?). 01-fail.smt and 01-succeeding.smt2 differ for about 3000 lines.

Silicon uses N Z3 instances for verifying methods, where N is the number of parallel verifiers; plus, for technical reasons, one instance for verifying functions and predicates. The latter interaction is recorded in file 00-....smt2, the files 01-... to 0N... correspond to the method verification instances.

With the same input Viper program and one parallel verifier, Silicon should always emit the same SMT output. Nondeterminism could come from collections, in particular hash-based ones, though. If you could send me the translation of elect.pvl, I'll investigate once I've some time (ETH's semester is about to start).

However, please keep in mind that changes in Silicon can at most stop performance fluctuations for now, but if the root cause are too many quantifier instantiations, the problem will most likely resurface, e.g. when the Z3 changes, or when minor changes to elect.pvl are made.

bobismijnnaam commented 4 years ago

Thank you for investigating. The full file is here: elect_full.zip.

Yeah, we understand this file will probably remain problematic for a long time to come. The main piece of knowledge we are looking for is what exactly about this file is so problematic, since once we know this we can avoid it (we even started building some kind of live z3 log output interpreter so we could figure out what part of the file is causing problems, but we couldn't get it to work).

When you say the root cause is that there are too many quantifier instantiations, do you mean there are too many foralls, that the triggers are too lax/general (applied/triggered too often), or something else?

Of course, if you have any suggestions of changes we can apply to the silver/source file to make it stable again, those are also very welcome.

mschwerhoff commented 4 years ago

Thank you for the additional file.

When you say the root cause is that there are too many quantifier instantiations, do you mean there are too many foralls, that the triggers are too lax/general (applied/triggered too often), or something else?

The root cause may be too many quantifier instantiations, an analysis with the axiom profiler could provide evidence of that. The number of foralls in the program is, by itself, typically not the real problem, but its the way in which they interact due to — too numerous — triggering. This recent paper on the axiom profiler> explains possible scenarios, if I'm not mistaken.

The problem could also be a combination of surface foralls, i.e. those literally in the program, and background foralls, i.e. those emitted by Silicon, e.g. for Viper functions or quantified permissions.

Of course, if you have any suggestions of changes we can apply to the silver/source file to make it stable again, those are also very welcome.

If I spot anything, I'll let you know. Either way, I'll get back to you once I've found time to investigate.

viperproject / silicon

Instability in test case, which disappears when removing unrelated program elements #515