Update Constraint Datatype

VictorCMiraldo commented 2 years ago

This PR changes the simple Constraints datatype into ConstraintSet, adding a little more structure, namelly, it adds a conjunct :: Constraint lang meta -> ConstraintSet lang meta -> Maybe (ConstraintSet lang meta) which returns Nothing whenever it the resulting ConstraintSet would be trivially UNSAT.

edit: This PR is ready for review, someone should really tackle the implementation of a proper union sort, as suggested here (and write good tests for that!); It might be worth to do it as a separate issue, after merging.

Unrelated important change: this change in the order of terms that are executed by the worker function is very important! We should always first try to expand the assumption term, which might prune all sorts of uninteresting branches.

VictorCMiraldo commented 2 years ago

The last commit introduces a hack for something we're not quite knowledgeable of: apparently, translation to SMT can fail and we would like to know exactly why and the implications of that.

VictorCMiraldo commented 2 years ago

Added some debugging code in here; plan for tomorrow is use the same debugging code on main and see how differently it behaves. I think we're pretty close.

VictorCMiraldo commented 2 years ago

While debugging this morning, I quickly spotted the issue. On the main branch, we collect the following constraints for cabal run spec -- -p /ohearn/ && /y == 11/:

b/sub $s2 1 == 0
∧ $s2 /= 0
∧ x ↦ D $s2 $s3
∧ __result ↦ D $s0 $s1
Body:   b/ite @Delta False (D $s2 42) (D $s2 $s3)
Assume: b/== $s1 42
Prove:  b/== $s3 11

Note how we actively collect b/sub $s2 1 == 0 and keep that! In our implementation of constraints, that is impossible to represent: we would gather that through TermEq [term| b/sub $s2 1 |] [term| 0 |], but that will trigger the unification algorithm to run and that will say /nope/: those are not unifiable. Consequently, that branch would be automatically pruned out and we would miss the counterexample, making the test run forever!

We need to record non-unification constraints in our ConstraintSet too and engineer it a little more carefully. A great plan for today would be to fix the bug then write tests for that module!

qaristote commented 2 years ago

When I merge this to qa/z3_inline-c_faster-eval_batches and run the benchmarks the running-time of both Z3 and Pirouette seem to increase by ~2s (out of 20s for Z3 and 40s for Pirouette). But this might just be noise.

qaristote commented 2 years ago

More precisely, here's the kind of benchmark results I get (after the PR / before the PR):

          Z3: 22.984s  count:  113600
     command: 30.788s  count: 1546100
launchSolver:  0.412s  count:     100
        main: 42.496s  count:       1
   readSExpr:  0.026s  count:   56800
  showsSExpr:  7.390s  count: 1659700

          Z3: 21.842s  count:  113600
     command: 29.886s  count: 2355500
launchSolver:  0.432s  count:     100
        main: 43.256s  count:       1
   readSExpr:  0.028s  count:   56800
  showsSExpr:  7.442s  count: 2469100

The time difference varies (here it's ~1s) but the pre-PR version is consistently faster than the post-PR one on the interaction with SMT solvers. I don't see how the post-PR version could be sending less commands to the solver but still spend more time inside it.

carlhammann commented 2 years ago

I don't see how the post-PR version could be sending less commands to the solver but still spend more time inside it.

This might be horribly outdated, but since I was (superficially) involved in the considerations that led to the redesign of the constraints type, maybe this helps: The goal was precisely to weed out trivial problems before consulting the SMT solver, in order to reduce the overhead incurred by sending them. So, it makes intuitive sense to me that there are fewer problems the solver sees, but it has to think harder about them. Do you have any data on, say, the size of the problems? (Or is my remark completely beside the point?)

qaristote commented 2 years ago

Something else I haven't mentioned here is that the Z3 count is only incremented when a command with interesting results is sent, typically check-sat (whereas the command count is incremented for every command). So I would expect the Z3 count to be smaller post-PR since the solver should not be checking for satisfiability as often.

So, it makes intuitive sense to me that there are fewer problems the solver sees, but it has to think harder about them.

I agree with the fewer problems part, but why would the solver have to think harder about them ? Aren't they supposed to be the same problems ?

Do you have any data on, say, the size of the problems?

Apart from the number of commands sent, no. What kind of data would help ?

carlhammann commented 2 years ago

Hmm, I think I can't really answer your questions accurately without knowing more about the benchmarks that you actually run and a dive into what the code in this PR actually does implement. But I think that the problems the solver sees should be

fewer post-PR (it seems strange to me that both runs have the same Z3 count, given your explanation of what it means), and
for each post-PR problem, there should be a logically equivalent pre-PR problem (but it might be formulated diffferently. I expect it to use fewer variables, in particular).

These are the two implications I expect from the changed learn function...

qaristote commented 2 years ago

I logged the commands sent to the solver pre- and post-PR and ran diff on the results. The output is huge but here are some interesting examples that I noticed while looking at the first lines (< is pre-PR and > is post-PR):

< (set-option :produce-models true)(declare-fun pir_greaterThan1 (Int) Bool)(declare-fun pir_greaterThan0 (Int) Bool)(declare-fun pir_add1 (Int) Int)(push 1)(declare-fun pir___result () Int)(declare-fun pir_x () Int)(check-sat)
---
> (set-option :produce-models true)(declare-fun pir_greaterThan1 (Int) Bool)(declare-fun pir_greaterThan0 (Int) Bool)(declare-fun pir_add1 (Int) Int)(push 1)(declare-fun pir___result () Int)(declare-fun pir_x () Int)(assert true)(check-sat)

and

< (pop 1)(push 1)(declare-fun pir___result () Bool)(declare-fun pir_s0 () Int)(declare-fun pir_x () pir_MaybeInt)(assert (= pir___result true))(assert (= pir_x ((as pir_JustInt pir_MaybeInt) pir_s0)))(check-sat)
---
> (pop 1)(push 1)(declare-fun pir___result () Bool)(declare-fun pir_s0 () Int)(declare-fun pir_x () pir_MaybeInt)(assert (and (= pir___result true) (= pir_x ((as pir_JustInt pir_MaybeInt) pir_s0))))(check-sat)

How I interpret this is that it seems this PR is doing at least two things:

when a problem is trivial, the solution is sent to the solver as an assert (c.f. the additional assert true in the first example)
when several assertions are sent to the solver their are grouped under a conjunction (c.f. the second example).

The first point explains why the solver is sent as many problems pre- and post-PR, and the second point explains why there are still less commands sent to the solver overall (I assume the number of times Pirouette detects a problem is trivial is negligible compared to the number of assertions that can be merged). This still doesn't explain why the solver is spending more time on solving the problems post-PR though: if anything, it should be spending less time because some problems now come directly with their solutions.

qaristote commented 2 years ago

I ran Z3 on the logfiles of the commands produced by Pirouette pre- and post-PR and I do get a noticeable difference in the running-times, although the files should be logically equivalent:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`z3 isunity.pre >/dev/null`	190.3 ± 12.6	176.4	231.9	1.00
`z3 isunity.post >/dev/null`	230.2 ± 3.0	223.6	241.0	1.21 ± 0.08

Looking at the diff between both files more closely it really seems the only differences are that post-PR 1) a few trivial and negligible assertions are added, 2) some redundant declarations and assertions are removed and 3) assertions are merged together using conjunctions. This is to be taken with a grain of salt of course because there are too many lines for someone to go through every one of them.

My guess right now is that Z3 spends more time parsing one big assertion with conjunctions than it does parsing many small assertions consisting of simple (in)equalities. @facundominguez, you had a look at the internals of Z3's parser, does that sound plausible to you ?

facundominguez commented 2 years ago

I have no intuition on the cost of parsing larger expressions. But it could be checked by splitting the conjunctions into separate assertions ahead of feeding them to the SMT solver.

facundominguez commented 2 years ago

Ah, and there is a function that only does the parsing Z3_parse_smtlib2_string. You might be able to measure if it worsens with the large conjunctions. I modified the oneshot.c program to invoke it at some point:

#include<stdio.h>
#include<stdlib.h>
#include<stdarg.h>
#include<z3.h>

void my_parser_example(char const* const fileName) {
  Z3_config cfg;
  Z3_context ctx;
  cfg = Z3_mk_config();
  Z3_set_param_value(cfg, "model", "true");
  ctx = Z3_mk_context(cfg);
  Z3_del_config(cfg);

  FILE* file = fopen(fileName, "r");
  char* buffer = NULL;
  size_t len;
  ssize_t bytes_read = getdelim( &buffer, &len, '\0', file);

  Z3_ast_vector b = 
      Z3_parse_smtlib2_string(ctx, 
                              buffer,
                              0,
                                NULL,
                                NULL,
                                0,
                                NULL,
                                NULL);

  printf("%s", Z3_eval_smtlib2_string(ctx, buffer));
  fclose(file);
}

int main(int argc, char* argv[]) {
    char const* const fileName = argv[1];
    my_parser_example(fileName);
    return 0;
}

qaristote commented 2 years ago

This doesn't seem to be the explanation: I merged the assertions on the logfile produced by Pirouette pre-PR and I got

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`z3 isunity.pre >/dev/null`	200.6 ± 13.5	182.1	226.7	1.00
`z3 isunity_merge.pre >/dev/null`	203.2 ± 3.6	197.2	227.0	1.01 ± 0.07

Niols commented 1 year ago

Are those benchmarks still up-to-date with the merge of #150? I would expect so because #150 is not supposed to change the generated constraints, but who knows.

qaristote commented 1 year ago

Are those benchmarks still up-to-date with the merge of #150?

Yep, I'm ran them again and I'm not seeing any difference.

qaristote commented 1 year ago

FYI if you want to run these benchmarks yourself you just have to do DEBUG_TIMESTATS_ENABLE=1 cabal run example on qa/z3_inline-c_faster-eval_batches and qa/z3_inline-c_faster-eval_batches_contstraints (the latter measuring the performance of this PR). Make sure to run direnv reload first as these branches use a pinned version of Z3.

Niols commented 1 year ago

With DEBUG_TIMESTATS_ENABLE=1 cabal run example, I can confirm @qaristote's observations: this PR makes things slightly slower (in particular, 10% more time spent in Z3). However, the overall run is barely affected (1-2%). I suggest we merge this PR as it cleans up a lot of code for a very minor cost.

tweag / pirouette

Update Constraint Datatype #146