mthom / scryer-prolog

A modern Prolog implementation written mostly in Rust.
BSD 3-Clause "New" or "Revised" License
2.02k stars 117 forks source link

More systematic testing of compilation #2181

Open triska opened 10 months ago

triska commented 10 months ago

2176, i.e., the system saying false even though there are solutions is a recent example of the worst kind of incorrectness, it's much worse than yielding wrong answer substitutions. That's because as long as we get something as a result, we can always test in other ways (for example: manually) whether it's really a solution. But if we get nothing, i.e., false, then we don't know: Is it because there truly is no solution, or is it because the system accidentally lost track of them?

This raises the urgent question: Is there anything we can do to find such issues more systematically?

In the concrete case of #2176, I had a program which I knew describes solutions because I found such solutions with earlier Scryer versions, and only therefore I noticed the incorrectness. However, are there other cases where we currently simply lose solutions? It seems that in order to find examples of such wrong compilations, we must generate programs which we know describe solutions.

It would also be nice to find a smallest fragment of the sample program that exhibits #2176, and add it to the test suite.

@haijinSk and @notoria, since you showed particular talent to construct such simpler cases, I hope this could be an interesting challenge also for you in particular. I would greatly appreciate your help with this!

triska commented 10 months ago

how(why) to continue in the issue. Even my Scryer is now only the new version, without the issue (I use: Code > Download ZIP). How I can make it meaningful/strong again. What can be systematic and useful now.

Yes, very good points, the concrete issue is indeed now resolved. Still, there are many useful contributions remaining:

  1. First, to find out which Scryer versions are affected by the issue. I suspect the issue was introduced with 0e583d620ab4ad95e55371905482c65dc5431bc9 and has since then (i.e., for more than 5 months) affected all Scryer versions since then. Is that the case? One way to test this systematically is to automatically build a Scryer executable for each commit, and to run the test case with each of these versions. This may seem pointless for this concrete issue, but the general approach is eminently sensible, and the point of this would be to set up an environment where we can quickly test each commit of Scryer, with its own separate executable. This will help is tremendously in the future, to quickly find the commit where an issue first appeared.
  2. Second, to add a test case to Scryer to make sure that #2176 is never re-introduced. I would favour a smallest fragment that exhibits the issue. The essence of the issue is that a query we expect to succeed fails. To find such a smallest fragment, we can use library(debug) and its operator (*)/1 to systematically generalize the program and find a smallest program fragment that still fails even though it is even more general. All goal that are "generalized away" can be textually removed from the program, and the remainder is a test case we can add to the build process to prevent this mistake from reoccurring. That in itself would also be a tremendous contribution. To work on this, you need to use one of the Scryer versions that exhibit the issue. You can use one of the executables you detect as affected in the step outlined above.

Automating step 2 is an example of using Prolog itself to make Scryer a great Prolog: In order to find a smaller fragment that exhibits unexpected failure, you can write a Prolog program that takes the program, generalizes it, and uses Prolog itself to test whether the generalization still fails.

One important point: The method I describe in (2) works only for programs in the pure monotonic core of Prolog. Only there does it hold that removing a goal yields a more general program, and only there can we find a smallest program that exhibits the issue in such a simple and systematic way, by arbitrarily inserting (*)/1 in front of goals, and even automate this way to construct a smallest fragment. Increasing the pure core of Prolog is therefore an important goal in the design of modern Prolog constructs.

UWN commented 10 months ago

It seems that in order to find examples of such wrong compilations, we must generate programs which we know describe solutions.

But in order to generate such programs, #1476

haijinSk commented 10 months ago

I suspect the issue was introduced with https://github.com/mthom/scryer-prolog/commit/0e583d620ab4ad95e55371905482c65dc5431bc9 and has since then (i.e., for more than 5 months) affected all Scryer versions since then. Is that the case?

The release version v0.9.3 (ZIP with (pre)compiled Scryer, 126d7bb-modified) seems to work with the Golfer Problem code, maybe even slightly better in regard to speed.

triska commented 10 months ago

That's already an excellent find, thank you a lot! So the mistake was introduced in one of the commits since the release, or only intermittently overshadowed by other changes!

notoria commented 10 months ago

Some small sample from good (1fbc7f984227d9019cf8d07a4e4851b23315b5eb) and bad (73ed08a412ed00428c3e9c1638a8033619c8051d):

% This does not reproduce #2176.
a :-
    (   (   c0(V)
        ;   integer(V),
            true
        )
    ;   integer(V),
        c1(V)
    ),
    c(V).

b :-
    (   (   (   c0(V)
            ;   integer(V),
                true
            )
        ;   integer(V),
            c1(V)
        ),
        c(V)
    ;   d0(W),
        (   d(W)
        ;   true
        )
    ).

In a/0 different instructions are generated. In b/0 different instructions are generated and less permanent variables are allocated in bad version. The presence of integer/1 changes how the code is generated. These predicates are extracted/generated from clpz_geq_/2.

This could help find the spot where the issue is happening.

haijinSk commented 10 months ago

Now, I don't have time to experiment/minimalize further. But here is my new observation/example:

The code to consult/test:

:- use_module(library(clpz)).

test([],_).
test([P|Ps], Q0) :-
        P #= Q,
        Q0 #< Q,
        write(i_am_here),nl, 
        test(Ps, Q).
The bad version output:

?- test([2000,3000,5000],10).
i_am_here
   false. % unexpected

The good version output:

?- test([2000,3000,5000],10).
i_am_here
i_am_here
i_am_here
   true. % as expected
notoria commented 10 months ago

Continuing with @haijinSk's query.

:- use_module(library(format)).
:- use_module(library(diag)).
% :- use_module(library(clpz)).

/*
:- dynamic(test/2).

test([],_).
test([P|Ps], Q0) :-
        P #= Q,
        Q0 #< Q,
        write(i_am_here),nl, 
        test(Ps, Q).
%*/

% ?- clause(test(Es,Q0), Body), portray_clause(test(Es,Q0), Body), false.
test([],A) :-
   true.
test([A|B],C) :-
    writeq(C), nl, % Inspect
   (  integer(A) ->
      (  true,
         var(D) ->
         D is A
      ;  integer(D) ->
         A=:=D
      ;  E is A,
         clpz:clpz_equal(E,D)
      )
   ;  integer(D) ->
      (  true,
         var(A) ->
         A is D
      ;  F is D,
         clpz:clpz_equal(A,F)
      )
   ;  clpz:clpz_equal(A,D)
   ),
   (  integer(D) ->
      (  integer(C) ->
        writeq([D,C]), nl, % Inspect
         D>=C+1
      ;  G is D,
         clpz:clpz_geq(G,C+1)
      )
   ;  integer(C) ->
      H is C+1,
      clpz:clpz_geq(D,H)
   ;  clpz:clpz_geq(D,C+1)
   ),
   write(i_am_here), nl,
   test(B,D).

query :-
    test([2000,3000,5000],10),
    false.

code :-
    PI = test/2,
    wam_instructions(PI, Is),
    lists:maplist(format:portray_clause, Is).

This is like a/0 the wrong instruction is generated. After expansion, the variable can be inspected and C is a bad reference to the stack on the second invocation of test/2. In the new stack C points to D thus the query fails.

haijinSk commented 10 months ago

In the good Scryer version in my test example the order of variables in the line P #= Q does not matter: P #= Q or Q #= P, both work. In bad/problematic version https://github.com/mthom/scryer-prolog/commit/73ed08a412ed00428c3e9c1638a8033619c8051d, if I change the order from P #= Q to Q #= P, then my example will work.

haijinSk commented 10 months ago

We can run the Golfer Problem code in the bad version successfully, if we change P//S #= Q to Q #= P//S in ascending_quotient/2 and ascending_quotient/3.

triska commented 10 months ago

These are all terrific insights, thank you a lot! To be absolutely honest, I would have been glad to see such insights appearing over the coming weeks or months. To see them posted here within hours of posting this issue is way more than I ever expected when I filed this issue, please keep up the great work!

I encourage everybody to further experiment with these examples and the different Scryer versions, with the final goal to distill the issue to a shortest fragment that shows the essence of the problem, and eventually to write Prolog programs that systematically generate arbitrarily many such examples.

A true test case of Scryer will run forever, continuously trying—and, if Scryer is implemented correctly, failing—to construct Prolog programs that are compiled incorrectly in this way.

haijinSk commented 10 months ago

There is one thing, perhaps irrelevant, so I apologize in advance. A mix of feelings and thinking of a Prolog beginner. Yes, you can imagine my own mental model of Prolog as a black magic box with a label: Here Be Dragons! And every head of a multi-head rule makes a whole new dragon for me for free. But I like dragons (I think they call it nondeterminism) more than hard thinking, I suspect. And, for example, I don't dream in big O notation, I can trust natural recursion like a child, etc. Call me a naive user of Prolog, I am.

Spurious correlations.

A different implementation, Trealla Prolog (v2.30.30), can pass my "i_am_here" test/example above successfully, yet will yield unexpected false with the Golfer Problem code #2176. And changing the order of variables in the P//S #= Q line is not helpful.

But even in context of one programming language implementation details are ever-changing with every "git-commit". I see it like this: more abstract, more general, more simple testing code, less helpful across new, future versions. Here I see dragons in a bad way.

In my imagination I can test every procedure against expected result, I think they call it unit testing, I can use pure monotonic core of Prolog to write my Prolog code and if it's not working as I expected, then I can use methods of logical debugging. But implementation details (and their bugs) are extra-logical to my Prolog code.

I hope my words are only implicit, half-baked, irrational and irrelevant question of a beginner.

Thank you for your patience.

triska commented 10 months ago

To yield correct results, library(clpz) requires a correctly working Prolog system. Taking the example of Trealla Prolog, it is known that it currently does not support library(clpz), we see several related issues that are already filed in its repository. Even the much simpler and shorter library(clpb) is not yet supported by Trealla, in the sense that several mistakes are already known. In the case of the known mistakes that arise when using these libraries with Trealla Prolog, the mistakes are not in the Prolog libraries, but in the underlying Prolog system. The interest to get a working library(clpb) and library(clpz) is present also in Trealla though, and several significant developments have already occurred to enable this support!

Such mistakes are normal, and more are found and corrected as the interest in the Prolog systems and libraries grows. We can compare these developments with the training of an organic muscle: If a muscle is not used, it atrophies. If a muscle is used, it gets stronger over time. In the same way, as more libraries use the underlying Prolog systems in more and more different ways, and application programmers use these libraries in more and more ways, the libraries and the underlying Prolog systems get stronger and more correct.

It is easy to build a wrong mental model of the internal machinery of a library, and by extension of a Prolog system. For example, with Trealla Prolog, we currently incorrectly get:

?- X #= Y + Z.
   false, unexpected.

Superficially, this may seem really easy to debug: After all, that's a single predicate, used in a very general way. How hard can it be to find and correct such an elementary mistake? Well, it turns out that this can be very hard if the predicates that are used in this example are not even defined in the source code, but are generated at compilation time of the library! That's because library(clpz) uses many mechanisms of the underlying Prolog system before a single query can even be posted, and all of these mechanisms must be implemented correctly for library(clpz) to work correctly.

A beauty of these developments: Once these mechanisms are implemented correctly in the underlying Prolog systems, all Prolog libraries benefit! Implementing them correctly is hard, no doubt, and errors may be introduced and fixed at certain points during development. But it is possible to implement all this correctly and efficiently: SICStus Prolog proves it. Scryer Prolog is a good example of a free Prolog system where many key mechanisms that are necessary for the implementation of constraint solvers work correctly. Trealla Prolog is already working on these areas. All future Prolog systems will have to provide these mechanisms with the same level of correctness if they want to become competitive in this specific area of Prolog implementations.

And on top of this, library(clpz) may even have its own mistakes and opportunities for improvement, we have seen several such examples in filed issues. The development of this library is analogous to that of Prolog systems: Every mistake we find and correct automatically benefits all applications that use the library. This is the advantage of a specific form of abstraction: With the interface of library(clpz), we abstract the meaning of these constraints from their internal implementation, and we can make them more efficient and more correct while the applications that use these constraints can remain completely unchanged.

The logical guarantees that library(clpz) provides in its monotonic execution mode enable a form of debugging that only logic programming offers. This form of debugging can be used to explain unexpected failure in application code, and it can be used to find implementation mistakes in the library itself.

haijinSk commented 10 months ago

@triska Thank you very much for your response; I appreciate it. I can learn from every post you write, thank you!

I think, my problem/perspective/fear here is a mental gap in my mind between two concepts; or I see/feel "this something" as two (connected but) separate things, as two different logical types/levels: testing a pure Prolog program (as a program design process) vs. testing a set of versions of a Prolog system with a Prolog program, automatically and reliably.

triska commented 10 months ago

Yes, testing a Prolog system for correctness is indeed a tough mental challenge: That's because, one the one hand, we rely on certain logical properties that we know must hold in a correctly implemented system, and we in a sense "trust" the system that we can formulate tests based on these properties. Yet, on the other hand, we must also be aware that these logical properties may not hold if the system is incorrectly implemented, and so "distrust" the system to enable the reasoning we count on to detect any issues! To continue working, we must jump over this mental gap, even repeatedly and in more than one direction: To find implementation mistakes, we formulate test cases based on the properties we know must hold, and then compare them with the results we actually get. If we see unexpected results, we use logical properties to obtain smaller program fragments that show the essence of the issue.

I recommend to start with the most elementary logical properties we know must hold in a correctly implemented Prolog system. For instance, we know that for two pure goals A and B, the conjunction (A,B) describes the same set of solutions as (B,A). Commutativity of conjunction of pure goals is a basic algebraic property we can rely on when testing a Prolog implementation. If we encounter a situation where exchanging two pure goals yields different solutions, then we have found a mistake in a Prolog implementation. Each of these goals may throw an instantiation error, and may also not terminate. That's OK: It simply means we cannot say anything further if such an error or nontermination arises. The fact that test cases may not terminate means that a Prolog system must implement reliable timeouts, so that we can decide how much computation time we want to invest for testing. In contrast to instantiation errors, type and domain errors are semantically equivalent to silent failure, and so it is for instance unacceptable for (A,B) to yield concrete solutions when (B,A) yields a type error.

library(clpz) provides an important guarantee: All predicates it exports always terminate. Anything else is an implementation mistake. This is a property that simplifies testing considerably.

haijinSk commented 10 months ago

Now, I can see some light! Thank you so much!!!

library(clpz) provides an important guarantee: All predicates it exports always terminate. Anything else is an implementation mistake. This is a property that simplifies testing considerably.

Your post is super important to me. I need more programming experience with Prolog and constraint programming, more thinking, reading and learning, but this is super important/useful right now for me. Thank you very much!!!

haijinSk commented 10 months ago

The essence of the issue is that a query we expect to succeed fails. To find such a smallest fragment, we can use library(debug) and its operator (*)/1 to systematically generalize the program and find a smallest program fragment that still fails even though it is even more general. All goal that are "generalized away" can be textually removed from the program, and the remainder is a test case we can add to the build process to prevent this mistake from reoccurring.

One of my naive perspectives/fantasies: I have to find a point, in a Prolog program, where "good working" ends. I know that my program is correct but as a whole fails, while running on a given version of a Scryer version. My program is here as a testing tool. I need to find what my current Prolog system do wrongly. So not only one point where the bad behavior starts, it might be many things at once.

And if my program is a testing tool, the purpose of my program has changed. So how to make a good testing tool in advance? I mean there will be new versions of my Prolog system and I can't predict bugs in the implementation. Many things might stop working.

But we are not interested in bugs as such (yes, the implementer is), we can abstract every low-level bug at advance. We know what "must hold" in a Prolog system; today, tomorrow, always. We can test every fundamental logical property a correct Prolog system must have. But to test something means to do something concrete, specific. We can define an (poetically speaking) "infinite" set by induction, but a Prolog implementation is not an abstract induction machine. Everything we do not test can be buggy. There are, probably, some hidden bugs in every non-trivial, useful program. A compiler is a program as well. Rust programming language is a program using programs... I'm a skeptic in this regard.

We, you have a collection of gems, so to speak. Prolog programs such as the Golfer problem solver. If a Prolog system pass successfully the collection, it means something. In my imagination there is something abstract and general in all the concrete programs. Bottom-up induction. There in no guarantee, but If the collection holds... I will trust the system.

Again, thank you...

triska commented 10 months ago

We can test every fundamental logical property a correct Prolog system must have.

This is indeed a key insight for testing a Prolog system, and you phrased it perfectly: "We can test every fundamental property a correct Prolog system must have."

Note especially the following: To test a Prolog system's implementation of the pure monotonic core of Prolog, we do not need to know what any predicate means! It suffices to know that programs in that set must satisfy certain logical properties, such as:

These logical properties are all we need in order to seriously test a very significant portion of any Prolog system. The formulation above therefore does not even need to mention a "specification", a "meaning" etc., all of which one would associate with a "test" in other programming languages.

Only logic programming lets us test implementations in this way: To test without even knowing what any predicate means! To test by arbitrarily removing portions of a program! To test by arbitrarily reordering goals! All of this is unthinkable in every other programming paradigm. It is a unique attraction of logic programming languages, and Prolog implementations in particular. There is thus hope that Prolog systems can become more correct than implementations of any other programming languages, because the logical properties we expect from implementations can be more easily formulated and systematically tested.

haijinSk commented 10 months ago

"They" say, there is "no silver bullet". But!!! this sounds to me like a holly grail of software testing... And more...

I want it, when I design my programs or to test a Prolog implementation by hand (or by my own collection of programs). I don't know how to integrate this to the Scryer automatic testing process, in a systematic and succinct way, but I want it there as well...

Again and again, I thank you very, very much for every your word!!! Thank you for your time, patience, effort, talent!!!

I apologize, that I don't see it so clearly and in all aspects... But, I can see something, I can see more and more light... I'm afraid, for Scryer it means nothing, my Prolog light seeing, but it means much to me personally.

Thank you so much...

UWN commented 10 months ago

Adding a fact has no influence on universal termination.

Has no influence on universal non-termination.

haijinSk commented 10 months ago

So... I can (but this can be my misunderstanding or/and a bad wording) do "syntactic subtraction" from any pure Prolog program code until I get false or nontermination (as the answer to my original "main" query), and the code before that last "subtraction" is something like a "logical shape" of my original program, a shape useful to test new versions of my Prolog system; at least in regard to my original Prolog program.

triska commented 10 months ago

In the pure monotonic core of Prolog, removing a goal makes the program or query more general, and removing a clause makes the program more specific. For example, if we have:

p(A, B, C) :-
        a(B, C),
        b(C),
        c(A, B).

then we can arbitrarily generalize the program it by inserting (*)/1 from library(debug) in front of goals:

p(A, B, C) :-
        * a(B, C),
        b(C),
        * c(A, B).

We can thus debug the program even if we have no idea at all what any of these predicates means: If the resulting fragment fails, then the original program also fails. To debug unintended failure, it therefore suffices to consider such fragments. No change in the strikeout parts can prevent the failure.

Adding false/0 at arbitrary locations lets us ignore all goals after false. For instance, we have:

p(A, B, C) :-
        a(B, C),
        false,
        b(C),
        c(A, B).

We know that p/3 defined in this way can never hold, since false/0 is always false. Yet, the following important property holds: If this fragment does not terminate universally, then the original program also does not terminate universally. No change in the strikeout parts can prevent the universal nontermination. To debug unintended loops, it therefore suffices to consider such failure slices.

Note how we can arbitrarily remove goals, and still obtain noteworthy relationships between the resulting fragments and the original program. Only logic programming allows such reasoning.

UWN commented 10 months ago

More examples of failure slices

haijinSk commented 10 months ago

Oh... So, this is it. I was so blind to these logical differences... to many aspects of that...

If:

No change in the strikeout parts can prevent the failure.

and:

If [a] fragment does not terminate universally, then the original program also does not terminate universally.

then it's a real beauty... and useful beauty... Now it seems almost simple, simply logical, and it was here all the time...

I know there is at least one video on this thing in your phenomenal project The power of Prolog, I'm sure, I saw it, at least once, at one time. I was so fascinated by DCGs recently... That was almost all my Prolog. Maybe, in my case, some things need the right time and context...

I thank you, again and again, for your patience and for every your word, thank you very much!!!

Also, @UWN thank you for the "failure slicing" link...

notoria commented 10 months ago

I'm using this to find issues in the compiler:

test.pl ```prolog :- use_module(library(lists)). :- use_module(library(between)). :- use_module(library(iso_ext)). :- use_module(library(format)). :- dynamic(sample/2). sample([], _). run(D0:N0) :- writeq(D0-N0), nl, Depth = 6, % Generate. between(0, Depth, D), D @>= D0, call_nth( ( tree(G, D, 0), term_variables(G, Gs), maplist(element([true,integer(_),_=_]), Gs), term_variables(G, Us), length(Us, Un), Un =< 12, bindery(E+A+B+G), term_variables(E+A+B+G, [E,A,B|Vs]), element([_|Vs], 0) ), N ), D:N @>= D0:N0, writeq(D:N), nl, ( assertz((sample([E|Es], A):-G,sample(Es, B))) ; retract((sample([E|Es], A):-G,sample(Es, B))), false ), % Test. findall(tt, mi(sample([0,1], 1)), Fs0), findall(tt, sample([0,1], 1), Fs), Fs \== Fs0, % Report. portray_clause([D:N,(sample([E|Es], A):-G,sample(Es, B))]), halt. element(Es, E) :- member(E, Es). mi(true). mi(false) :- false. mi((G0,G)) :- mi(G0), mi(G). mi((G0->G)) :- mi(G0), !, mi(G). mi((G0->G;_)) :- mi(G0), !, mi(G). mi((G;_)) :- \+ subsumes_term((_->_), G), mi(G). mi((_;G)) :- mi(G). mi(integer(T)) :- integer(T). mi(A=A). mi(sample(Es, A)) :- clause(sample(Es, A), G), mi(G). tree(_, S, S). tree(;(G0,G), S0, S) :- succ(S1, S0), tree(G0, S1, S2), tree(G, S2, S). tree(->(G0,G), S0, S) :- succ(S1, S0), tree(G0, S1, S2), tree(G, S2, S). tree(','(G0,G), S0, S) :- succ(S1, S0), tree(G0, S1, S2), tree(G, S2, S). bindery(T) :- term_variables(T, Vs), '@bindery'(Vs). '@bindery'([]). '@bindery'([A|As]) :- split(As, Bs, Cs), maplist(=(A), Bs), '@bindery'(Cs). split([], [], []). split([A|As], Bs, [A|Cs]) :- split(As, Bs, Cs). split([A|As], [A|Bs], Cs) :- split(As, Bs, Cs). % sh -c "ulimit -v 1000000; ulimit -c 0; ./scryer-prolog test.pl -g 'run(0:0)' -g halt" ```
tester.sh ```shell #!/bin/sh if [ -z "$1" ]; then echo "$0 [start]"; exit 1; fi if [ -z "$2" ]; then curr=0:0; else curr=$2; fi while [ "$curr" != "$prev" ]; do prev=$curr; curr=$( sh -c "ulimit -v 1000000; ulimit -c 0; ./target/release/scryer-prolog $1 -g 'run($curr)' -g halt" | \ tee /dev/stderr | \ grep -E '^[0-9]+:[0-9]+$' | \ tail -f -n 1 \ ); done ```

I run:

$ sh ./tester.sh test.pl 0:0 2>&1 | grep -E -v '(^[0-9]+:|dumped|memory)'

It's also possible to run scryer-prolog directly like:

$ sh -c "ulimit -v 1000000; ulimit -c 0; ./scryer-prolog test.pl -g 'run(0:0)' -g halt"

It's a bit slow to restart when scryer-prolog crashes (smaller counter could help), so there is room for improvement.

This version is small and doesn't test all the features of the compiler. It mainly tests (',')/2, (;)/2, (->)/2, integer/1 and bit of how variables are allocated with (=)/2.

In order to test the compiler, a sample program is generated (tree/3, ...) and executed by both Scryer and a small meta-interpreter (mi/1). The results of the execution are retrieved with findall/3 and compared. If the results are different then the sample is reported and the search halts.

This is one way to generate small program. Another way could be to randomly generate a big program, if there is an issue then reduce the program and report it.

Finding a smaller sample than this one for #2176 may prove to be difficult since more issues have been fixed.

triska commented 9 months ago

Another recent example of the system silently losing solutions: https://github.com/mthom/scryer-prolog/issues/2257#issuecomment-1872588330

Ideally, all such situations are found automatically, even if it may take a long time to find them.

triska commented 9 months ago

Another recent example of the system silently losing solutions, the worst kind of incorrectness: #2272.

triska commented 8 months ago

Another recent example, once again due to the way strings are currently implemented, making it easy to forget handling them in certain branches because their handling is spread over so many places: #2293.