Closed nikomatsakis closed 5 years ago
Oh, this PR also ports to Rust 2018 and adds some tests. Didn't find any bugs though. =)
One side effect of this change is that you can often pull computation out from the "while loop" of the iteration and instead just construct Relation tuples directly. It seems to me that this must be more efficient, but @frankmcsherry I'd be curious if you think I am wrong.
One thing I've been wondering is whether it makes sense to "stratify" where you can. e.g., if you can transform something like this:
let mut iteration = Iteration::new();
let var1 = iteration.variable("var1");
let var2 = iteration.variable("var2");
...
while iteration.changed() {
var1.from_join(...); // does not use `var2`
var2.from_join(&var1, ...);
}
into a series of loops:
let var1: Relation<_> = {
let mut iteration = Iteration::new();
let var1 = iteration.variable("var1");
...
while iteration.changed() {
var1.from_join(...); // does not use `var2`
}
var1.complete()
};
let var2 = Relation::from_join(&var1, ...);
would that be more efficient? The advantage is that you are processing fewer things at a time; the disadvantage is that we are invoking complete
on var1
and thus merging a bunch of relations that never needed to be merged before. I guess the only way to know is to measure it.
(Obviously, sometimes you must do this, e.g. to use antijoin
on a computed relation.)
I think stratification can and does make sense, but the trade-offs you point out seem legit.
Another reason to stratify execution is to reclaim resources between the strata. Imagine you have a bunch of these, and perhaps var1
gets used in the definition of var2
, but not afterwards: you should be able to compute up through var2
and then recover any resources associated with var1
(and the intermediate arrays associated with var2
).
In principle there is also strength reduction, where things could be faster with a Relation
than with a Variable
. Treefrog leapjoin is an example of that, where the Variable
implementations just don't exist at the moment because they are a pain to write well.
the disadvantage is that we are invoking complete on var1 and thus merging a bunch of relations that never needed to be merged before.
You don't have to call complete
, if you don't want. You could just keep it a Variable
if that is better. So, perhaps "stratification is good" can be handled independent from "completing a now-static Variable
may be good".
@frankmcsherry yeah, that all makes sense.
My hunch is that, in a lot of these cases, the differences won't be so extreme anyway -- it probably makes sense to 'start out' stratified and then selectively merge.
Treefrog leapjoin is an example of that...
On an unrelated note regarding leapjoin: I was debating about extending the interface so that you can supply a tuple of "joining things", to eliminate the virtual dispatch and looping -- but I wanted to loop at the actual assembly, I sort of suspect LLVM might optimize it to the same thing.
e.g., you could imagine permitting
from_leapjoin(
&var,
(
foo.extend_with(...),
bar.extend_anti(...),
),
logic,
)
where the ()
works instead of &mut [&mut dyn Foo]
.
I guess it'd be just more ergonomic anyway. Maybe I should do it. =)
As an example, from_leapjoin seems to be making an extra vector on each iteration.
FWIW, though, I saw no difference in perf when porting polonius over to use this branch.
In order to reuse a Variable one hadn't completed (in the sense of "you don't have to call complete
if you don't want") would there need to be a way to move Variables from one Iteration to the next though (to continue computation in another strata while avoiding merging the Relations at the end of the current strata)?
It could be interesting to measure but I'm guessing Lark has more uses of this pattern than Polonius (which has none in the current repo: there's 1 in the CFG compression branch, one in my "missing subsets" branch, and possibly in the future when using both the location insensitive and opt variants in a single analysis) ?
would there need to be a way to move Variables from one Iteration to the next though
I don't think so. All that associating a Variable
with an iteration does is link the "are we done yet?" logic for the iteration with the variable. If you happen to have a variable around because it came from a previous loop, you should just be able to use it.
That it works is sort of a happy coincidence of not having a "runtime" that manages all of this, and that lie may eventually come around and bite folks (that "this works" is a weird combination of idioms and luck).
I was debating about extending the interface so that you can supply a tuple of "joining things", to eliminate the virtual dispatch and looping
Probably! You might need to tweak the traits so that there is something like a LeapMany
which gets implemented for (L, LM)
where L: Leaper, LM: LeapMany
. But that should allow static resolution of all of the calls. It is all in a pretty tight loop, so it could be that the virtual dispatch makes a difference. What we do in dataflow land is batch the calls instead, so that you extend a few prefixes at a time in each virtual call. No evidence that this is obviously better, though.
Yeah I mean you will be able to use it as a join/map input (probably not all antijoins) to one of the current iteration's Variables — and this could definitely be useful.
I was thinking about not being able to produce new tuples for the old Variable without calling changed()
on it yourself in the new strata, since the new Iteration won't do it for you. However this particular case seems unlikely to ever be useful in practice, especially translating rules from datalog, and could be what you meant that it might bite folks.
I implemented and tested the tuple-based leapjoin API -- see the last commit. It proved to be about 1% more efficient (21bil instructions vs 21.3bil instructions before). But it's also just a touch nicer to use:
live_to_dying_regions_r2pq.from_leapjoin(
&subset_r1p,
(
cfg_edge_rel.extend_with(|&((_, p), _)| p),
region_live_at_rel.extend_with(|&((r1, _), _)| r1),
region_live_at_rel.extend_anti(|&((_, _), r2)| r2),
),
|&((r1, p), r2), &q| ((r2, p, q), r1),
);
instead of
live_to_dying_regions_r2pq.from_leapjoin(
&subset_r1p,
&mut [
&mut cfg_edge_rel.extend_with(|&((_, p), _)| p),
&mut region_live_at_rel.extend_with(|&((r1, _), _)| r1),
&mut region_live_at_rel.extend_anti(|&((_, _), r2)| r2),
],
|&((r1, p), r2), &q| ((r2, p, q), r1),
);
It could be interesting to measure but I'm guessing Lark has more uses of this pattern than Polonius
Perhaps now, but I think this will change as we try to move more things over into polonius. e.g., I'd like to move initialization checking.
Regarding moving variables between iterations, I considered not calling complete
. Indeed, I suppose it would work just fine. It just makes it a bit less obvious to know when you have "fixed" results that are not changing per-iteration and when you don't.
Another benefit of the tuple-based API:
It enables filter_with
and filter_anti
leapers to be sensibly used on their own. I made them only work with an "auxiliary value" of ()
, which is implicitly proposed. You can then use them to implement intersection and difference:
e.g.,
let intersection1 = Relation::from_leapjoin(
&input1,
input2.filter_with(|&tuple| tuple),
|&tuple, &()| tuple,
);
and
let difference1 = Relation::from_leapjoin(
&input1,
input2.filter_anti(|&tuple| tuple),
|&tuple, &()| tuple,
);
(Of course you might also selectively use parts of the tuple
input.)
I don't see yet how you could use extend_anti
on its own, so that is not enabled. It feels like that needs something to be "proposing" values to make sense (I suppose you could use it to encode a kind of "exists
OK, well, having used this API a bit I can't really imagine going back, so I'm inclined to merge it. =) The main question I guess is whether it's "ok" to have as many "convenience methods" as we do -- I'm inclined to say yes, and in fact I might add more (e.g., to express things like set-difference and so forth), but that can wait. I don't see a reason to push for real minimalism here, and in general having precise and simple ways for users to say things (e.g., set difference) might enable more efficient implementations in the future if we ever cared.
These additions are looking great.
I wonder if having some operators with the logic
callback returning an Option<Tuple>
in addition to the ones returning a Tuple
would allow filtering (for example the Polonius symmetries) in an easy way, or if it would be less efficient than, say, #20.
Ah yes I forgot about #20 -- we should land that before this, probably, and then I can rebase over it. I'm not sure I quite understand your proposal regarding Option<Tuple>
though, @lqd -- which method would return the Option<Tuple>
? Or would we add that somewhere?
It seems like integrating the PrefixFilter
would be simple enough, it can implement the Leapers
trait in roughly the same way as filter_with
and so forth. The ValueFilter
should not implement Leapers
, since it doesn't really make sense in isolation (though it could implement in roughly the same way as filter_with
, i.e., substituting a dummy ()
value, but it'd just be a less efficient version of PrefixFilter
so...).
As mentioned on Zulip (so, for folks following at home) it was another option à la
fn from_filterable_join<K: Ord, V1: Ord, V2: Ord>(
&self,
input1: &Variable<(K, V1)>,
input2: &Variable<(K, V2)>,
logic: impl FnMut(&K, &V1, &V2) -> Option<Tuple>,
) {
}
but is less interesting, in every way, than PrefixFilter
and ValueFilter
.
I rebased over #20 and implements Leapers
for PrefixFilter
now.
While doing some work using datafrog, I got annoyed about the need to create intermediate variables and things instead of being able to work directly with relations. I also noticed that
Relation::from
, which currently takes an iterator and collects to an intermediate vector, introduces a certain amount of inefficiency. (As an example,from_leapjoin
seems to be making an extra vector on each iteration.)This branch aims to correct both things:
Relation::from
equivalent toRelation::from_vec
instead ofRelation::from_iter
FromIterator
forRelation
, so you can dolet foo: Relation = iter.collect()
Relation
with a number of methods, likeRelation::from_join
,Relation::from_antijoin
, andRelation::from_leapjoin
Variable
with anextend
method, equivalent to.insert(Relation::from_iter(...))
Variable::from_join
with the ability to take one argument that is a relationRelation::from_join
insteadOne side effect of this change is that you can often pull computation out from the "while loop" of the iteration and instead just construct
Relation
tuples directly. It seems to me that this must be more efficient, but @frankmcsherry I'd be curious if you think I am wrong.cc @lqd, @frankmcsherry