pschanely / CrossHair

An analysis tool for Python that blurs the line between testing and type systems.
Other
1.04k stars 49 forks source link

Obtain symbolic values and path conditions without concretization after `path_cover` for post-testing analysis #142

Open hc825b opened 3 years ago

hc825b commented 3 years ago

Is your feature request related to a problem? Please describe. I would like to obtain symbolic values and the path condition for each generated test during or after path_cover explores a function. Ideally, the symbolic values and the path condition are not simplified with concrete values but retains the full SMT formula (with complex functions as uninterpreted, etc.) for a particular path.

My particular application is to automatically generate a symbolic/logical representation in SMT for simple Python functions that are loop-free, recursion-free, without side-effects, but using some tuples and lists. I think the information I want can be used in other applications such as test reduction, e.g., checking the path conditions are syntactically equivalent for different tests, or inferring post-conditions for simple functions.

Describe the solution you'd like I assume the symbolic values and the path condition are computed during the path_cover process, but might be discarded right after solving the SMT instances. I would like to know how to save and use them in post-testing.

Describe alternatives you've considered Another way is that, assuming I already have a set of tests to exhaust all paths of a function (derived manually or by path_cover), how can I generate the symbolic values and path conditions for each test using crosshair?

I would want to work on implementing this if you can point me to right places to start. My exploration tells me it has to do with how the StateSpace class maintains information of both SMT solvers and PathNodes, but I find it difficult to progress any further.

PS. This is such an awesome project with simple yet very practical ideas. Thanks! Chiao Hsieh

pschanely commented 3 years ago

Thanks! This is a really interesting idea.

As a first step, perhaps experiment with a few examples; you're certainly on the right track with StateSpace. It should be easy to dump the solver state by adding a line like print(state.solver.sexpr()) before or after this call to detach_path.

But! Something to worry about:

A few Python types have obvious representations in Z3, e.g. int. But in general, the Z3 model requires some sort of interpretation to be meaningful. As it evolves, I wouldn't want CrossHair to be tied to particular modeling - that's probably my biggest concern with exposing the Z3 model. There are even cases where CrossHair might attempt different Z3 representations for the same Python type. (as you might imagine, modeling decisions can significantly impact Z3's ability to solve the problem)

Z3 expressions can be introspected; in theory, we could work on tactics for transforming Z3 expressions back into Python expressions. That's probably the approach I'd recommend to develop this into a fully-fledged feature. (and I'd be happy to work with you on it). There are many other practical uses for this. For example, when debugging a counterexample, it's often hard to see what parts of the input are important to the bug, and which parts are irrelevant. A set of "path condition" expressions could potentially be much more useful. Another area that's interesting here is specialization/optimization - for simple-ish functions with finite paths, we might be able to re-generate the whole function via the logical representation. Anyways ... all pretty speculative ideas, but fascinating!

hc825b commented 3 years ago

Thanks for the reply! And thanks for the pointers!

I can see the complication due to different possible encodings into Z3 supported SMT formulas, and I agree that is the key to make the SMT query efficient and tractable in my experience. In this case, how about the alternative? Assuming I already have a set of test inputs, and I only want to collect the path condition expressions and symbolic values for each test input. The back-end for symbolic execution does not have to be z3 or SMT solvers because there is no need to find satisfiable values in branches. I can choose to use a more expressive logic formula tool e.g. SymPy, or even bolder use a subset of Python expressions (e.g., only allow immutable objects as variables and function/method calls without side effect). Ideally, this can ensure one unified encoding for a give test input and its corresponding execution path, and there is no need to translate Z3 formula back to Python expression.

This is probably no longer a feature request for crosshair, but using the same infrastructure to build another "symbolic semantics" for interpreting Python statements.

pschanely commented 2 years ago

Interesting. Indeed, this feels fairly distant from CrossHair's current implementation, as we don't really collect path conditions in any form other than SMT. There probably is space for the definition of some sort of intermediate language, as you suggest ("logical Python"?), but I haven't put much thought into it!

pschanely commented 2 years ago

See also #165 for another use case that might benefit from work in this direction.