rust-lang / datafrog

A lightweight Datalog engine in Rust
Apache License 2.0
796 stars 43 forks source link

Proposal: extend API to support more operations on relations [breaking change] #22

Closed nikomatsakis closed 5 years ago

nikomatsakis commented 5 years ago

While doing some work using datafrog, I got annoyed about the need to create intermediate variables and things instead of being able to work directly with relations. I also noticed that Relation::from, which currently takes an iterator and collects to an intermediate vector, introduces a certain amount of inefficiency. (As an example, from_leapjoin seems to be making an extra vector on each iteration.)

This branch aims to correct both things:

One side effect of this change is that you can often pull computation out from the "while loop" of the iteration and instead just construct Relation tuples directly. It seems to me that this must be more efficient, but @frankmcsherry I'd be curious if you think I am wrong.

cc @lqd, @frankmcsherry

nikomatsakis commented 5 years ago

Oh, this PR also ports to Rust 2018 and adds some tests. Didn't find any bugs though. =)

nikomatsakis commented 5 years ago

One side effect of this change is that you can often pull computation out from the "while loop" of the iteration and instead just construct Relation tuples directly. It seems to me that this must be more efficient, but @frankmcsherry I'd be curious if you think I am wrong.

One thing I've been wondering is whether it makes sense to "stratify" where you can. e.g., if you can transform something like this:

let mut iteration = Iteration::new();
let var1 = iteration.variable("var1");
let var2 = iteration.variable("var2");
...
while iteration.changed() {
    var1.from_join(...); // does not use `var2`
    var2.from_join(&var1, ...);
}

into a series of loops:

let var1: Relation<_> = {
    let mut iteration = Iteration::new();
    let var1 = iteration.variable("var1");
    ...
    while iteration.changed() {
        var1.from_join(...); // does not use `var2`
    }
    var1.complete()
};

let var2 = Relation::from_join(&var1, ...);

would that be more efficient? The advantage is that you are processing fewer things at a time; the disadvantage is that we are invoking complete on var1 and thus merging a bunch of relations that never needed to be merged before. I guess the only way to know is to measure it.

(Obviously, sometimes you must do this, e.g. to use antijoin on a computed relation.)

frankmcsherry commented 5 years ago

I think stratification can and does make sense, but the trade-offs you point out seem legit.

Another reason to stratify execution is to reclaim resources between the strata. Imagine you have a bunch of these, and perhaps var1 gets used in the definition of var2, but not afterwards: you should be able to compute up through var2 and then recover any resources associated with var1 (and the intermediate arrays associated with var2).

In principle there is also strength reduction, where things could be faster with a Relation than with a Variable. Treefrog leapjoin is an example of that, where the Variable implementations just don't exist at the moment because they are a pain to write well.

the disadvantage is that we are invoking complete on var1 and thus merging a bunch of relations that never needed to be merged before.

You don't have to call complete, if you don't want. You could just keep it a Variable if that is better. So, perhaps "stratification is good" can be handled independent from "completing a now-static Variable may be good".

nikomatsakis commented 5 years ago

@frankmcsherry yeah, that all makes sense.

My hunch is that, in a lot of these cases, the differences won't be so extreme anyway -- it probably makes sense to 'start out' stratified and then selectively merge.

Treefrog leapjoin is an example of that...

On an unrelated note regarding leapjoin: I was debating about extending the interface so that you can supply a tuple of "joining things", to eliminate the virtual dispatch and looping -- but I wanted to loop at the actual assembly, I sort of suspect LLVM might optimize it to the same thing.

e.g., you could imagine permitting

from_leapjoin(
    &var,
    (
        foo.extend_with(...),
        bar.extend_anti(...),
    ),
    logic,
)

where the () works instead of &mut [&mut dyn Foo].

nikomatsakis commented 5 years ago

I guess it'd be just more ergonomic anyway. Maybe I should do it. =)

nikomatsakis commented 5 years ago

As an example, from_leapjoin seems to be making an extra vector on each iteration.

FWIW, though, I saw no difference in perf when porting polonius over to use this branch.

lqd commented 5 years ago

In order to reuse a Variable one hadn't completed (in the sense of "you don't have to call complete if you don't want") would there need to be a way to move Variables from one Iteration to the next though (to continue computation in another strata while avoiding merging the Relations at the end of the current strata)?

It could be interesting to measure but I'm guessing Lark has more uses of this pattern than Polonius (which has none in the current repo: there's 1 in the CFG compression branch, one in my "missing subsets" branch, and possibly in the future when using both the location insensitive and opt variants in a single analysis) ?

frankmcsherry commented 5 years ago

would there need to be a way to move Variables from one Iteration to the next though

I don't think so. All that associating a Variable with an iteration does is link the "are we done yet?" logic for the iteration with the variable. If you happen to have a variable around because it came from a previous loop, you should just be able to use it.

That it works is sort of a happy coincidence of not having a "runtime" that manages all of this, and that lie may eventually come around and bite folks (that "this works" is a weird combination of idioms and luck).

frankmcsherry commented 5 years ago

I was debating about extending the interface so that you can supply a tuple of "joining things", to eliminate the virtual dispatch and looping

Probably! You might need to tweak the traits so that there is something like a LeapMany which gets implemented for (L, LM) where L: Leaper, LM: LeapMany. But that should allow static resolution of all of the calls. It is all in a pretty tight loop, so it could be that the virtual dispatch makes a difference. What we do in dataflow land is batch the calls instead, so that you extend a few prefixes at a time in each virtual call. No evidence that this is obviously better, though.

lqd commented 5 years ago

Yeah I mean you will be able to use it as a join/map input (probably not all antijoins) to one of the current iteration's Variables — and this could definitely be useful.

I was thinking about not being able to produce new tuples for the old Variable without calling changed() on it yourself in the new strata, since the new Iteration won't do it for you. However this particular case seems unlikely to ever be useful in practice, especially translating rules from datalog, and could be what you meant that it might bite folks.

nikomatsakis commented 5 years ago

I implemented and tested the tuple-based leapjoin API -- see the last commit. It proved to be about 1% more efficient (21bil instructions vs 21.3bil instructions before). But it's also just a touch nicer to use:

            live_to_dying_regions_r2pq.from_leapjoin(
                &subset_r1p,
                (
                    cfg_edge_rel.extend_with(|&((_, p), _)| p),
                    region_live_at_rel.extend_with(|&((r1, _), _)| r1),
                    region_live_at_rel.extend_anti(|&((_, _), r2)| r2),
                ),
                |&((r1, p), r2), &q| ((r2, p, q), r1),
            );

instead of

            live_to_dying_regions_r2pq.from_leapjoin(
                &subset_r1p,
                &mut [
                    &mut cfg_edge_rel.extend_with(|&((_, p), _)| p),
                    &mut region_live_at_rel.extend_with(|&((r1, _), _)| r1),
                    &mut region_live_at_rel.extend_anti(|&((_, _), r2)| r2),
                ],
                |&((r1, p), r2), &q| ((r2, p, q), r1),
            );
nikomatsakis commented 5 years ago

It could be interesting to measure but I'm guessing Lark has more uses of this pattern than Polonius

Perhaps now, but I think this will change as we try to move more things over into polonius. e.g., I'd like to move initialization checking.

nikomatsakis commented 5 years ago

Regarding moving variables between iterations, I considered not calling complete. Indeed, I suppose it would work just fine. It just makes it a bit less obvious to know when you have "fixed" results that are not changing per-iteration and when you don't.

nikomatsakis commented 5 years ago

Another benefit of the tuple-based API:

It enables filter_with and filter_anti leapers to be sensibly used on their own. I made them only work with an "auxiliary value" of (), which is implicitly proposed. You can then use them to implement intersection and difference:

e.g.,

        let intersection1 = Relation::from_leapjoin(
            &input1,
            input2.filter_with(|&tuple| tuple),
            |&tuple, &()| tuple,
        );

and

        let difference1 = Relation::from_leapjoin(
            &input1,
            input2.filter_anti(|&tuple| tuple),
            |&tuple, &()| tuple,
        );

(Of course you might also selectively use parts of the tuple input.)

I don't see yet how you could use extend_anti on its own, so that is not enabled. It feels like that needs something to be "proposing" values to make sense (I suppose you could use it to encode a kind of "exists { !foo(X, T) }` test, which wouldn't really be legal datalog per se).

nikomatsakis commented 5 years ago

OK, well, having used this API a bit I can't really imagine going back, so I'm inclined to merge it. =) The main question I guess is whether it's "ok" to have as many "convenience methods" as we do -- I'm inclined to say yes, and in fact I might add more (e.g., to express things like set-difference and so forth), but that can wait. I don't see a reason to push for real minimalism here, and in general having precise and simple ways for users to say things (e.g., set difference) might enable more efficient implementations in the future if we ever cared.

lqd commented 5 years ago

These additions are looking great.

I wonder if having some operators with the logic callback returning an Option<Tuple> in addition to the ones returning a Tuple would allow filtering (for example the Polonius symmetries) in an easy way, or if it would be less efficient than, say, #20.

nikomatsakis commented 5 years ago

Ah yes I forgot about #20 -- we should land that before this, probably, and then I can rebase over it. I'm not sure I quite understand your proposal regarding Option<Tuple> though, @lqd -- which method would return the Option<Tuple>? Or would we add that somewhere?

It seems like integrating the PrefixFilter would be simple enough, it can implement the Leapers trait in roughly the same way as filter_with and so forth. The ValueFilter should not implement Leapers, since it doesn't really make sense in isolation (though it could implement in roughly the same way as filter_with, i.e., substituting a dummy () value, but it'd just be a less efficient version of PrefixFilter so...).

lqd commented 5 years ago

As mentioned on Zulip (so, for folks following at home) it was another option à la

fn from_filterable_join<K: Ord, V1: Ord, V2: Ord>(
        &self,
        input1: &Variable<(K, V1)>,
        input2: &Variable<(K, V2)>,
        logic: impl FnMut(&K, &V1, &V2) -> Option<Tuple>,
    ) {
}

but is less interesting, in every way, than PrefixFilter and ValueFilter.

nikomatsakis commented 5 years ago

I rebased over #20 and implements Leapers for PrefixFilter now.