pschanely / CrossHair

An analysis tool for Python that blurs the line between testing and type systems.
Other
1.04k stars 49 forks source link

Re-investigate string and sequence solvers #189

Open pschanely opened 2 years ago

pschanely commented 2 years ago

Noted in #187:

Hello, I was digging into this trying to isolate the problem that part of the problem might be how the 'in' operator is mapped to z3.

Crosshair watch returns something on this function after 55 seconds, meaning (if I understand correctly this part) it takes that time to generate an example given the pre-conditions. Because post-conditions are never met it should fail in the first example. This happens also with dictionaries and Sets.

def keys_in_an_array(arr: List[str]) -> bool:
    """
    pre: "a" in arr
    pre: "b" in arr
    pre: "c" in arr
    pre: "d" in arr
    pre: "e" in arr
    post: __return__ == True
    """
    return False if "a" in arr else True

But solving the same constraints in z3 takes 0.1s

In [15]: import z3
    ...: import time
    ...: start_time = time.time()
    ...: s = z3.Solver()
    ...: sseq = z3.Const('sseq', z3.SeqSort(z3.StringSort()))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("a"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("b"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("c"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("d"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("e"))))
    ...: s.add(z3.Contains(sseq, z3.Unit(z3.StringVal("f"))))
    ...: s.check()
    ...: print(s.model())
    ...: print("--- %s seconds ---" % (time.time() - start_time))
[sseq = Concat(Unit("c"),
               Concat(Unit(""),
                      Concat(Unit("f"),
                             Concat(Unit(""),
                                    Concat(Unit("d"),
                                        Concat(Unit(""),
                                        Concat(Unit("e"),
                                        Concat(Unit(""),
                                        Concat(Unit("a"),
                                        Concat(Unit(""),
                                        Unit("b")))))))))))]
--- 0.01843857765197754 seconds ---

Sorry if I misunderstood something, I am still trying to get my head around the internals of it.

Originally posted by @petrusboniatus in https://github.com/pschanely/CrossHair/discussions/187#discussioncomment-4189852

petrusboniatus commented 2 years ago

I was investigating more on this, because efficient "in" operations turned out to be unavoidable for the data frame issue.

  1. Sequence of Sequences, for example a Sequence of strings, are very unstable, I get unknowns in z3 if:
    • I put a length in a constraint
    • I put distinct in a constraint
    • Even scaling up the example I created above. So they are discarded for now.
  2. If all "contains" operation are compressed in one you get decent performance on Sequence of Ints, This is very usefull to model sets because order does not matter. This is an example of a set of ints with more than 150 elements an 100 in constraints, solved in less than 2 seconds

    s = z3.Solver()
    l = z3.Const("int_seq", z3.SeqSort(z3.IntSort()))
    sub = z3. Empty(z3.SeqSort(z3.IntSort()))
    for i in range(100):
       sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))
    
    s.add(z3.Contains(l, sub))
    s.add(z3.Distinct(l))
    s.add(z3.Length(l) > 150)
    s.check()
    print(s.model())
  3. To maintain the performance and have a Set List of Strings for example can be achieved by mapping the integer values through a z3 "array":

    import IPython; IPython.embed()
    s = z3.Solver()
    l = z3.Const("string_seq", z3.SeqSort(z3.IntSort()))
    idx = z3.Const("map", z3.ArraySort(z3.IntSort(), z3.StringSort()))
    
    sub = z3.Empty(z3.SeqSort(z3.IntSort()))
    for i in range(100):
      sub = z3.Concat(sub, z3.Unit(z3.IntVal(i)))
      s.add(idx[i] == f"{i}")
    
    s.add(z3.Length(l) > 150)
    s.add(idx[l[120]] == "potato")
    s.check()
    print(s.model())

My proposal is to use the principle in 3th to model Sets and Lists in crosshair. I still will try to make an implementation of a crosshair proxy of at least Set[Int] to check that it translates as well to actual performance.

Edit: changed code messed up by the spell checker, and corrected 1 error.

pschanely commented 2 years ago

Interesting stuff! I've got a bunch of things to talk about here. I'm travelling right at the moment, but will circle back in a day or two!

pschanely commented 2 years ago

CrossHair's current implementation makes surprising choices for representation in Z3. In most cases, I started with the "obvious" choice and later ended up moving towards something dumber (trading more execution paths for less work during solving).

To keep maintenance low and quality high, I've largely tried to keep only one kind of implementation for each Python type. But in the ideal world, CrossHair probably would try different modelings for the same type in parallel. I very much want this kind of "representation forking" for floats, as it happens. At any rate, it would be great to see whether some of your work can stand alongside the existing implementations. (and we could also easily run some evaluations - hopefully over time, the SMT solvers get better and we can ditch some of CrossHair's simpler representations)

Okay, let's get into some of the details. Starting simple, let's just talk about regular old strings. I'll try to follow up with additional mini-writeups about sets and lists here soon.

string

Current modeling (source): A variable Z3 integer length, and a lazily created/expanded list of Z3 integers, one per codepoint. Obvious question: why not use a Z3 string?

Now, there are lots of reasons to prefer a native SMT string. A variety of simplistic string scenarios could be handled much more effectively (e.g. just find a string in another string). And handling for complex cases may now be better than I recall (historically, it seemed that the solving can easily get slow; nesting str.substr calls was problematic at one point).

petrusboniatus commented 1 year ago

Hello, I will be a little busy this week. I still plan to have the Set[I] by the end of the week.

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

pschanely commented 1 year ago

Just wanted to comment maybe a maybe stupid idea that is to have extra Interfaces for a subset of the builting api that we know that works well. One example might be a dictionary without Iterate or length that more closely resemble an array in z3. Jax follows a similar 'same api with restrictions' approach and I think it works good. This might come very handy for the user to build his symbolic representation / type system for his classes that are too problematic to be directly mapped by z3.

That doesn't sound stupid to me!

petrusboniatus commented 1 year ago

Hi again. I have an implementation of the Set[int] here only trying in for FrozenSets. I've not tested the actual performance yet (I am more confident with the List sequence), but passes the pre-existing unit tests.

In the meantime I have a question: I had to do a little modification to the core to be able to register a proxy for specific type with generics (FrozenSet[int]), as it was the fastest way to start testing this, not sure if it is the proper way because it feels like this dictionary is tracking origin type for some specific reason.

pschanely commented 1 year ago

Oh wow, you got quite far on your own. Let me answer your immediate question, and then will try to find a little time to review and say more things later this week.

We look up symbolic factories using the origin of the type rather than the actual type (see here). There isn't an important reason for this, except for that's USUALLY what we want. (one factory for all types of sets, lists, etc) If you want to keep that logic as-is, you can just add your set implementation under new branches in make_set(), which is probably what I'd advise for now.