nwf / dyna

Dyna2 compiler and REPL
GNU Affero General Public License v3.0
146 stars 20 forks source link

implement gensyms #43

Open jeisner opened 11 years ago

jeisner commented 11 years ago

As per longstanding design, the unary symbol * is a gensym.

Basic gensyms should be easy enough to implement for now (the design only gets tricky once we add stuff like persistence/saving, multiple dynabases especially with extension, and prefix aggregators #39).

The main thing you need is a little syntactic sugar. The rule

p(W,X) min= q(X,Y) + r(Y,*).

just needs to be read as if it were

p(W,X) min= q(X,Y) + r(Y, $g5(W,X,Y)).   

where the functor $g5 means that this is the 5th gensym * that has been encountered by this Dyna process (there's a global counter), and the arguments are all of the variables from the rule (if any). The $g5 functor should be declared to have a dispos that says it doesn't evaluate, so that it doesn't need to be quoted when it is printed or read.

One question is whether the user should be able to type literal gensym names like $g5. That is certainly bad form in a program since it's dangerously implementation-dependent: they can't really guess what the name will be. Worse, it could result in clashes: when loading a .dyna file that contains $g5 when $g5 was already defined.

On the other hand, it can be useful in queries (imagine tracing back some pointers implemented with gensyms). I think a good compromise is to allow it in queries, but to reject it in user rules -- here's a suggested message:

Error: You can't refer to the gensym $g5 directly in a rule.  
Use `*` to get a new gensym.  If you want to refer to the 
same gensym across rules, you must give it a reusable 
name by assigning it to an item.
nwf commented 11 years ago

Gensyms are harder to get right than we'd like, and it's really not a matter of only syntactic sugar. Gensyms capture the non-deterministic, non-gensym variables in a rule, which means that we need to have much stronger analysis of determinism information in plans, of which we have some, as well as some type information, of which we have none.

The "non-gensym" part is to avoid circularity in the definition, which is necessary for correctness. The "non-deterministic" part is necessary for both correctness (consider a rule with X = f(*) -- * must not depend on X) and for efficiency, so that we do not change gensyms whenever values change. You elided this when you expanded your syntactic sugar -- you knew to include only W, X, and Y in the list. If, on the other hand, the user had written something more like ANF,

p(W,X) min= Q is q(X,Y), R is r(Y,*), Q + R.

then because r/2 may return its second argument as part of its value, we must not include R in the dependence of *, and we do not want to include Q, since that will mean changing the gensym (and recomputing r(Y,*)) for every update to q/2.

I will continue to suggest that gensyms are entirely unpronounced in Dyna, both in the surface language and in queryes, and that if you want to name a gensym, you have to fold it out yourself or extract it as part of a query.

P.S. We should prove that our gensym semantics are invariant under planner choice, and should ensure that all plans produce the same gensym, which will almost undoubtedly require some form of canonicalization of the list of non-deterministic points. A separate concern is that when I wrote "variable" above I really meant "ground structure" -- variables over-estimate the nondeterminism of groundings of a rule; this may not be a problem (I do not have a counter-example handy) but it could be worth watching out for in the proof above.

jeisner commented 11 years ago

The "non-gensym" part is to avoid circularity in the definition, which is necessary for correctness. The "non-deterministic" part is necessary for both correctness (consider a rule with X = f() -- \ must not depend on X) and for efficiency, so that we do not change gensyms whenever values change. You elided this when you expanded your syntactic sugar -- you knew to include only W, X, and Y in the list.

Yes. I was suggesting that to get gensyms flying, that we could just harvest the surface variables from whatever the user wrote, before auto-evaluation desugaring kicks in. (This is indeed inefficient in the case where the user desugars manually.) This answers your first and third points, doesn't it? Your second point about X=f(*) is a good one: I hadn't appreciated that (or else forgot an earlier discussion). Would it be terrible to assume temporarily that the user just won't write that kind of thing, or to temporarily prevent the user from writing that kind of thing? It would be better to get it right, of course.

I will continue to suggest that gensyms are entirely unpronounced in Dyna, both in the surface language and in queryes, and that if you want to name a gensym, you have to fold it out yourself or extract it as part of a query.

Well, how about in printing the results of queries? I had imagined something like

> query foo(P,Q).
foo("a", $g5) = 1.
foo("b", $g6) = 2.
foo("c", $g5) = 1.

If all of these gensyms were simply printed as *, then we couldn't tell that two of them were equal. And we couldn't query them further as in query bar(P, $g6) -- I'm not quite hard-hearted enough to want to deny the users the ability to do that. FWIW, Lisp creates gensym names for the user's sake.

On the other hand, I don't think we have to print all of the arguments to a gensym term: that is revealing more than needed. Something like a memory address would be fine. Here is a proposal: internally, the gensyms are represented as ground terms such as groundings of $gensym5(W,X,Y). When the interactive printer is asked to print a particular gensym such as $gensym5(1,2,3), instead it prints a name like $g77. This name is not the gensym itself, but it evaluates to the gensym. (So we have to be careful when printing terms that we do not quote $g77 when it appears in an evaluation context, and that we force evaluation of it in a non-evaluation context.) The first time the gensym is printed, the printer magically adds a rule such as $g77 def= $gensym5(1,2,3) (where the "77" is a global counter maintained by the printer, so that the names assigned are sequential within a repl session). Thereafter, the printer uses the same name is used, thanks to its own hash table or just query of the form $gensym5(1,2,3) is Name.

Thinking ahead: Once we have defaults and stuff, we may sometimes have to print non-ground versions of gensyms as well, I think? But maybe $gensym5(W,2,3) would be printed as something like :gensym G so that we know it's a gensym but not which one it is? Of course, two references to $gensym(W,3,5) with the same W would have to print the same G.

We should prove that our gensym semantics are invariant under planner choice, and should ensure that all plans produce the same gensym,

This seems a bit stronger than needed. If two different plans resulted in a different but isomorphic set of gensym names, that'd be ok, wouldn't it?

which will almost undoubtedly require some form of canonicalization of the list of non-deterministic points.

Doesn't this happen when the rule is compiled and * is translated into $gensym5(W,X,Y)? Choosing that name happens first, and thereafter all the planning just runs on the gensym-free rules, no? So how can different plans do different things?

nwf commented 11 years ago

We should prove that our gensym semantics are invariant under planner choice, and should ensure that all plans produce the same gensym,

This seems a bit stronger than needed. If two different plans resulted in a different but isomorphic set of gensym names, that'd be ok, wouldn't it?

Unfortunately, no, because in the long run we want to re-plan at runtime. If we cannot guarantee equality of gensyms, then switching plans will require that we shoot down all the old ones first. This, of course, is correct only if we run one plan at a time -- attempting to mix plans for exploration will not work out well.

Doesn't this happen when the rule is compiled and * is translated into $gensym5(W,X,Y)? Choosing that name happens first, and thereafter all the planning just runs on the gensym-free rules, no? So how can different plans do different things?

That would be ideal, but, determinism is a property of the modes in which functions are invoked. I do not think we can do the rewriting until after query planning (and after type elaboration, too), and that leaves us open to non-canonical representations as we vary query order. In particular, consider a join like f(X) = g(Y) where f/1 and g/1 are bijections (i.e., all of out is f(in), in is f(out), out is g(in), and in is g(out) are supported and deterministic) which support enumeration (i.e., out is f(out) and out is g(out) are both supported, too, though non-deterministic). For the query plan of f-then-g, the nondet vars are {X,Result}, while for the query plan of g-then-f, the nondet vars are {Y,Result} (where Result is the result of evaluation; it would be possible, though more work, to deduce that only one of those two variables are actually needed (in both cases) -- the safe assumption is that all variables bound by a non-det operation are non-det). I don't think we can simply take the intersection -- consider a three-way join of bijections in the Prolog fragment, e.g. f(X,Y), g(Y,Z), h(Z,A); various plans would produce {X,Y}, {Y,Z}, or {Z,A} as the nondet vars set, and of course there is non-determinism in this rule, despite the empty intersection.

jeisner commented 11 years ago

But the semantics of the rule should be independent of mode! So if our definition of the semantics of gensyms depends on the mode, we should try to fix the definition so that it doesn't depend on the mode anymore. Or depends on some canonical mode rather than the actual mode in which the rule is executed.

nwf commented 11 years ago

Sure, thus "prove that our gensym semantics are invariant under planner choice". I am quite open to suggestions, but admit that I am at a loss.

jeisner commented 11 years ago

Except for your X=f(*) example, which we can return to, here's how I would have thought of it:

I always thought we defined the semantics as capturing all variables. Then it's merely an efficiency issue that we may not need to capture all of them because others are redundant.

For efficiency, we can choose any subset so that conditioned on that subset, the rest are fully determined. Then we just always capture that subset (regardless of plan). There may be multiple choices of subset, even multiple minimal choices on rare occasions.

We would prefer to pick a subset such that gensyms have a small memory footprint individually and (if we intern them) collectively. We'd also prefer for the gensyms not to change too much when the inputs change or as execution runs.

On this last point, consider your example f(X) == g(Y) (I think you meant == rather than =), or equivalently Result is f(X), Result is g(Y). Your point is that any non-empty subset of the three variables {X,Y,Result} would be enough to reconstruct the others. So we should just pick one, ideally. But the one we pick might affect efficiency. Fleshing out your example,

weight(*) = 1 for f(X)==g(Y).

will change the set of weight names in the chart when the bijections f and g are updated. It's possible that the set of Xs for which f(X)==g(Y) for some Y will remain most stable, or that the set of Results such that Result is f(X), Result is f(Y) for some X,Y will remain most stable.


Could the definition of gensym semantics just focus on is and ignore other things that might be deterministic?

out is any is semidet. (Am I writing that correctly?) What else has (semi)det modes and where does it get them? Built-ins?

nwf commented 11 years ago

Well, they surely cannot capture all variables, since that would include capturing themselves (trivially in ANF, or if the user writes X=* so as to reuse *, in which case, this overt variable must be excluded, too; X = f(*) is essentially the same problem but poses a bigger challenge to a type-based way of dealing with the problem). In the case of Y is f(*), Y must be omitted as well; while the mode out is f(out) could enumerate such things, it seems hard to imagine that the f/1 rules could construct the right gensyms a priori. (Here's possibly the first case we've seen of something that supports out and not in; of course we can transform the in case to out and unification.)


I don't think you mean out is any, since F is G, with both F and G free, is not permissible, never mind semidet. out is in is (by fiat) semidet for the terms which support evaluation (so it's not really fair to say that out is in is semidet in general, since we do not, I think, plan to have code that corresponds to such a thing?)

(P.S. The appendix, in discussion of diamond normal form, notes that we must exclude dynabase vars from diamonds, but offers a syntactic test. I think, however, that that is in error, as it does not account for circularity as discussed here. Are you publishing an errata or up-to-date version of the tech report?)

jeisner commented 11 years ago

Let's start with fundamentals. I think the basic intuition is that we "simply" want to create a different value of * on each hyperedge. The hyperedges are defined by the groundings of the rule as if * were a constant, and only then do we specialize * accordingly.

If that principle is correct, then the variable capture rules should flow from that. However, is it correct? For example, suppose that we write a rule like f(A,[A,A]) = 1. so that the second argument has a functional dependency on the first argument. Then something like count += f(*,_) should be 1, or infinity? What if the definition of f depends on other items so that the functional dependency is only sometimes present?

nwf commented 11 years ago

I like treating X = * as a deterministic unification; it nicely avoids capturing X in the expansion of any * in the rule (including its own).

I think count += f(*,_) (good example) would be 1: the completions are all specializations of count += 1 for 1 is f(X,[X,X])., with X grounded to a gensym, so there are no non-gensym variables for * to capture. Does that seem reasonable?

I think that any fundep which is "only sometimes present" must be ignored by our estimate of determinism?

jeisner commented 11 years ago

[Here is something I wrote yesterday and just finished up. However, it's superseded by a simpler treatment that nwf and I discussed, and which I'll put in the next comment.]

I think that any fundep which is "only sometimes present" must be ignored by our estimate of determinism?

By describing it as an estimate of determinism, do you mean that later implementations of Dyna could use tighter estimates? I don't think so for cases where that would change the semantics.

I'd rather not regard what we are doing as "estimating determinism" but more as tagging hyperedges. If changing some other item in the program gives us more hyperedges for this rule, for example by breaking a fundep, then the number of gensyms should increase.

I was going to say that it doesn't matter whether we keep any of the same gensyms as before. But maybe it does matter for the question of stable randomness across snapshots, and testing equality across snapshots. So the internal naming of a gensym (e.g., as $gensym5(8,9)) may matter.

I like treating X = * as a deterministic unification

Again, the trick is to treat each literal * as a new constant. So it doesn't match any other constant, but it does match variables (at least variables whose type includes gensyms): that is why f(*,_) worked above.

A crucial fact is that because the constant is new, changing it to another constant can't affect whether the rest of the grounding succeeds. None of the rules or built-ins give different results depending on which gensym they get: anything that is true of one gensym is true of another.

(Again, I'm a bit worried whether that is really the case for random number generators. It is okay for a RNG that just generates and memoizes a value, but maybe not for an RNG that hashes the actual gensym representation using a randomly chosen hash function. But maybe for the sake of defining the semantics we can pretend it is true, or rely on the fact that different gensyms give the same distribution of random numbers, i.e., over samples of the hash function.)

Let's consider how to interpret this rule

foo(X) += 1 if X = * && is_bool(Y).

whose result should be that foo(X) is 1 for two different gensym values of X.

One notion I had was that we first interpret the rule by instantiating * as a specific arbitrary new gensym, which I'll denote as *789*:

foo(X) += 1 if X = *789* && is_bool(Y).

There are many groundings of this rule, including

foo("bar") += 1 if "bar" = *789* && is_bool("bananas").
foo("bar") += 1 if "bar" = *789* && is_bool(true).
foo("bar") += 1 if "bar" = *789* && is_bool(false).
foo(*456*) += 1 if *456* = *789* && is_bool("bananas").
foo(*456*) += 1 if *456* = *789* && is_bool(true).
foo(*456*) += 1 if *456* = *789* && is_bool(false).
foo(*789*) += 1 if *789* = *789* && is_bool("bananas").
foo(*789*) += 1 if *789* = *789* && is_bool(true).
foo(*789*) += 1 if *789* = *789* && is_bool(false).

but only the last two have successful bodies. We now take each of those successful bodies and replace *789* by a different fresh gensym in each case:

foo(*7891*) += 1 if *7891* = *7891* && is_bool(true).
foo(*7892*) += 1 if *7892* = *7892* && is_bool(false).

Unfortunately, this doesn't quite work in cases like this:

foo(A) = * for foo(A)==bar(A) || start.   % or equivalently: foo(A) = G for G=*, (G is bar(A) || start).
bar(A) = foo(A).

which should give a self-supporting cycle (that can be kicked off by setting start=true temporarily). The problem is that when once we have bar(0) being *7891*, then reevaluating the first rule (with start=false) will fail when we arbitrarily choose * to denote *789*, because *789**7891*.

So a better story is that we first interpret the rule by taking * to denote a new variable _G, restricted to have a gensym type associated with this particular *. Then we look at all groundings in which _G does not necessarily have to be grounded -- it is specialized as little as possible for the grounding to succeed (I'm not sure this is well-defined!). In each grounding in which _G remains a free variable rather than having unified with something (like the *7891* in the previous example), we replace it with a different fresh gensym (again *7891*, *7892*, etc.).

I think that any fundep which is "only sometimes present" must be ignored by our estimate of determinism?

If we internally represent a gensym as something like a grounding of $gensym5(X,Y), then I agree that we ought to capture Y if it might not be unique given the other variables. However, that is not the only way to represent a gensym. If we have

count(X) += f(*,Y) / g(X).
f(A,[A,A]) += 1.
f(A,[]) += 2 for switch.
f(A,B) += 3 for faucet.   % not mutually exclusive with other rules

then one option is to interpret * as $gensym5(X,Y) as above, so that we always store Y. But we could also interpret it variously as $gensym5(X), $gensym6(X), and $gensym7(X,Y) according to the value of Y, so that if switch and faucet are usually false then we don't waste space on storing Y.

(I am only mentioning this to make a point. In practice I think we would indeed standardize on $gensym5(X,Y), but for efficiency, implement that with a union class so as to avoid storing Y in the frequent cases.)

jeisner commented 11 years ago

Here's another, simpler story, which @nwf actually buys -- we discussed it from 3:30-5:00 yesterday -- except that I'll try to improve it a little more here.

As proposed earlier, the semantics is that we internally replace a given * token in the program with something like $gensym5(A,B,C,...) whose arguments capture all variables in the rule. The functor $gensym5 is specific to the particular * token, and I think we'll want to preserve it under snapshotting and extension. (Conceivably even under edits to the source .dyna file that preserve or approximately preserve this rule? That will reduce the amount of recomputation. But maybe not: since to make it persistent, we might have to derive its internal name from an MD5 hash of the source file.)

(Note that we'll do this eventually for dynabase handles as well: a gensym is really just an empty dynabase.)

( Exception: We don't capture variables whose names start with underscore. The user is explicitly excluding these. This is very convenient in code that extracts feature weights, for exampe. And it doesn't add any power since the user is already free to manufacture gensyms that depend on any subset of the variables they like. E.g., if they want to depend on C only, then instead of writing * they can write mygensym(C) where they defined mygensym(_)=*.)

( Note: We need to decide how * interacts with infix aggregators. Should it capture B in edge(A,*) max= g(A) / (+= g(A,B))? -- probably no since that would guarantee that the += would have only one aggregand. Should it capture A in f(A,B) = cartesian_product(myset(A), (set= [* | B]))? -- I'm guessing no again, but less clear. In other words, I think we replace the infix aggregation with a temp expression before we choose the arguments for the *.)

Gensym terms should probably be unanalyzable. Sure, we internally represent a gensym as a grounding of $gensym5(A,B,C,...), but it doesn't print like that, and the user cannot destructure it with unification nor with accessors for the internals of terms. The debugger might however let one see where a gensym came from (i.e., which grounding of which rule)

Now, it is possible that some variables in the rule are bound to structures that contain the genysm itself, via a condition like B=* or B=f(*) ... in which case $gensym5(A,B,C) is actually a cyclic term that fails the occurs check. But so what? It's not a real term in the term universe that the user can see; it's just a possibly cyclic a structure that we are building for internal use. We can test identity of cyclic structures as needed.

(By the way, what are we going to do about the occurs check generally in Dyna?? Maybe cyclic terms can turn into $error values? We have the usual problem that if the name of an item is an error value, it is hard to report the error.)

In practice, the implementation does not have to store the full cyclic structure. It can simplify the representation of a gensym provided this doesn't collapse gensyms that were formerly distinct. Therefore, we analyze the rule to find a sufficient set of variables, meaning that once those variables are known (along with `itself, because it will be built from those variables, whatever they are), the other variables are fully determined in any successful grounding.* (In other words, the mode where the sufficient variables areinand the others areoutis semidet, whether or not we can actually generate a plan for that mode!) Then the arguments to$gensym5` only need to include this sufficient set -- that is enough to distinguish the distinct gensyms.

When @nwf and I met yesterday, we were ready to accept that the particular sufficient set might be implementation dependent in a cyclic case like

f(*) = A=f(B), B=g(A), A+B.   % it's enough to store either A or B

or even in a less weird case like

f(A,B) = * for A+delta==B.   % it's enough to store either A or B

But it does matter whether we store A or B or both. Why? Suppose we have two snapshots d and e of the dynabase with the rule above, with d.delta==0 and e.delta==1. Thus, d defines items like f(5,5) and f(6,6), while e defines items like f(5,6). If we store both A and B, then e.f(5,6), d.f(5,5), and d.f(6,6) are all different gensyms. But If we store A then e.f(5,6)==d.f(5,5) (they're the same gensym), whereas if we store B then e.f(5,6)==d.f(6,6) instead.

It would be better not to leave this implementation dependent since the behavior should be predictable. And I think there's a simple and natural way to standardize. In the case above, since delta is not const, I think the gensym should actually store both A and B -- the functional dependence between them is not "permanent" but depends also on the extension we're in. As a result, e.f(5,6), d.f(5,5), and d.f(6,6) will all be different gensyms. That seems intuitive to me, since * in this example was probably intended to identify each * with a particular ordered pair (A,B).

The crucial point is that if delta were const, then storing either A or B would be tantamount to storing both, and (as long as the gensyms are unanalyzed) the user can't tell which is being stored. The implementation may have to know, though, because a binary file might get handed off from an implementation that stores A to one that stores B.

So how do we find the sufficient set S?

p.s. One example from our discussion on the blackboard:

 A is f(B,C), C is f(A), B is g(D), X=f(Y), Y=*, Y is g(Z)