rust-lang / datafrog

A lightweight Datalog engine in Rust
Apache License 2.0
801 stars 45 forks source link

Datalog query with leapjoin #51

Open kelped opened 7 months ago

kelped commented 7 months ago

Hey, I'm just playing around with datafrog and trying to translate some basic datalog queries to the equivalent Rust code. I am wondering what something like the following snippet looks like in datafrog?

node(1, 'A').
node(2, 'B').
node(3, 'C').
node(4, 'D').

edge(1, 2).
edge(2, 3).
edge(3, 1).
edge(1, 4).

depthTwo(W1, W2) :-
    page(T1, W1),
    page(T2, W2),
    link(T1, T2).

I'm mostly interested in leapjoin and have tried something like:

let mut iteration = Iteration::new();

let pages: Relation<_> = vec![
    (1, "super.com"),
    (2, "page.com"),
    (3, "subpage.com"),
    (4, "minpage.com"),
]
.into();
let edges: Relation<(u32, u32)> = vec![(1, 2), (2, 3), (3, 1), (1, 4)].into();
let edges_rev: Relation<_> = edges.iter().map(|&(from, to)| (to, from)).collect();

let var = iteration.variable::<(u32, &str)>("var");
var.insert(pages.clone());

while iteration.changed() {
    var.from_leapjoin(
        &var,
        // So I technically would like to filter by the edges
        // but the nodes i want to filter by are not really there yet?
        // And I can't extend with nodes and filter with the edges
        edges_rev.extend_with(|&(a, _)| a),
        //
        |&(a, b), &c| (a, b),
    );
}

But I can't quite wrap my head around how the leapers work even after reading the comment given in the source code of the repo. Any pointers would be appreciated!

IFcoltransG commented 3 weeks ago

I'm in a similar situation to you, having figured out what leapjoins do but not how to use them. For your rules the best I could come up with uses two joins:

let mut iteration = Iteration::new();

let pages: Relation<(u32, &str)> = vec![
    (1, "super.com"),
    (2, "page.com"),
    (3, "subpage.com"),
    (4, "minpage.com"),
]
.into();

let edges: Relation<(u32, u32)> = vec![(1, 2), (2, 3), (3, 1), (1, 4)].into();

let pages_source_var: Variable<(u32, &str)> = iteration.variable("pages_source_var");
pages_source_var.insert(pages.clone());

let intermediate_page_edge_var: Variable<(&str, u32)> =
    iteration.variable("intermediate_page_edge_var");

let depth_two_var: Variable<(&str, &str)> = iteration.variable("depth_two");

while iteration.changed() {
    intermediate_page_edge_var.from_leapjoin(
        &pages_source_var,
        edges.extend_with(|&(edge_start, _)| edge_start),
        |&(_, initial_page_name), &edge_end| (initial_page_name, edge_end),
    );
    depth_two_var.from_leapjoin(
        &intermediate_page_edge_var,
        pages.extend_with(|&(_, edge_end)| edge_end),
        |&(start_name, _), &end_name| (start_name, end_name),
    );
}

println!("{:?}", depth_two_var.complete().elements);
IFcoltransG commented 3 weeks ago

I've thought a bit more about it and here's the explanation I could come up with. Looking at the API for leapjoin:

fn from_leapjoin(
  &self: &Variable<Tuple>,
  source: &Variable<SourceTuple>,
  leapers: impl Leapers<SourceTuple, Val>,
  logic: impl FnMut(&SourceTuple, &Val) -> Tuple
)

&self is where the results get stored, it represents the left side of the :- in a rule. I'm going to ignore leapers for a second, and simplify the rest: from_leapjoin takes tuples from source: Variable<SourceTuple> and calls logic: Fn(SourceTuple, ???) -> Tuple to get tuples to add to self: Variable<Tuple>.

Without leapers you could use the function for simple rules, where the left hand side depends only on one variable, e.g. reachable(x, y) :- edge(x, y, cost). logic can just extract x and y. But in most cases you'll need to augment that variable's information with other information from the database -- that's where the Val parameter of logic comes in.

logic needs to calculate each entire left-hand-side of the rule using one variable (the source), and Val. Let's say in your example we pick edges as the source variable for calculating depthTwo; then Val will have to contain both strings W1 and W2. I'm going to take a look at the leapers that produce Val.

For every tuple from the source variable, each leaper proposes a set (or restricts an already-proposed set) of Vals. The intersection of all of these sets is eventually given to logic, alongside the original source tuple. If we had a leaper proposing as Val every possible pair of pages (T1, W1) and (T2, W2), then we could use other leapers to restrict that set down to the ones with a given edge between them. Firstly a filter_with (I think) that would filter out ((T1, W1), (T2, W2)) if the source tuple input isn't an edge starting from T1, and secondly one that filters those if the source tuple isn't an edge ending in T2.

depth_two_var.from_leapjoin(
  &edges_var,
  (
    todo!("something proposing all pairs of pages"),
    edges.filter_with(|&(start, _)| (start, todo!("not sure what goes here"))),
    edges_rev.filter_with(|&(_, end)| (end, todo!("not sure what goes here")))
  ),
  |_src, ((_, start_name), (_, end_name))| (start_name, end_name)
);

The biggest difficulty might be proposing ((T1, W1), (T2, W2)). The leaper constructors are roughly:

fn function_name(
  &self: &Relation<(Key, Val)>,
  key_func: Fn(&SourceTuple) -> Key
) -> impl Leaper<SourceTuple, Val>

You can see there's only one input relation, self, but in order to propose every possible pair of (T1, W1) and (T2, W2) we need to access two relations: pages, and pages again. (The fact that it's the same relation twice doesn't affect the problem; we need a Cartesian product.) So it turns out that making a leaper proposing all the pairs we need can't be done without doing a second join, a cross join, beforehand, to create the pages × pages relation.

More succinctly, any data that constructs the tuple on the left-hand-side of the rule has to either come from your chosen source variable, or needs to be from a set enumerated by at least one leaper. If we choose edges as the source, we need to enumerate every pair of pages.

Anyway, doing a cross join felt inefficient compared to the two leapjoins in my other comment.