vmware / database-stream-processor-compiler

Infrastructure to run programs written in high-level languages on top of the Database Stream Processor (DBSP) runtime.
Other
16 stars 2 forks source link

[RFC] Embedding Datalog in a functional language #2

Open ryzhyk opened 3 years ago

ryzhyk commented 3 years ago

RFC: Embedding Datalog in a functional language

This RFC sketches a Datalog dialect implemented as syntactic sugar on top of a general-purpose functional language (e.g., Rust). It strives to combine the expressiveness and flexibility of a general-purpose language with conciseness and clarity of Datalog.

LINQ and Differential Dataflow are two examples of general-purpose languages that support declarative queries over relational data. To this end, they implement all standard relational operators as higher-order functions (filter, map, join, group, etc.) This design has many important advantages:

The main downside of the approach is that it can be much more verbose than Datalog, as one must supply closures to each operator to unwrap the input record, bind its columns to local variables, and wrap the result of the operator back into a struct or tuple. Here is the same rule written in DDlog and in a Rust-like language:

relation R1(a: usize, b: usize)
relation R2(c: usize, d: usize)

// DDlog
R3(x,z) :- R1(x,y),
           R2(y,z).

// Rust
let R3 = R1.join(R2,
                 |r1| r1.b,  // extract join key from R1
                 |r2| r2.c,  // extract join key from R2
                 |r1, r2| R3{e: r1.a, f: r2.d}); // produce output record

Clearly, the second version is harder to understand and easier to get wrong.

We propose a syntactic sugar on top of the functional syntax, which eliminates this overhead while keeping the above advantages. The idea is that instead of writing closures explicitly, the programmer writes patterns that contain variable bindings, and the compiler automatically derives closures that produce and consume tuples containing bound variables. For example, the above rule would be written as:

let R3(x,z) = R1(var x, var y).join(R2(y, var z));

which translates to the Rust code above. Note that this rule is syntactically close to Datalog.

In the rest of this RFC, we define a vocabulary of relational operators and their pattern-based versions. We leave the following questions outside the scope for now:

  1. The choice of the base language. While we can define the new language from scratch (similar to DDlog-1), it can also be designed as an extension of an existing language, e.g., Rust. There are all kinds of pros and cons to both options and this choice will have major impact on the structure of the project.

  2. Support subset of the language. The exact subset of the language that can be compiled into efficient incremental code is limited by the capabilities of the compiler and the backend. For example, even though we can manipulate relations as values in the language, Differential Dataflow does not support this, so we may only be able to compile programs where all rules can be evaluated statically.

  3. Compiler architecture. Depending on the answer to question 1, the compiler will be built as an extension of an existing compiler (e.g., rustc), an embedded DSL (e.g., in GHC or C#), or a standalone compiler.

  4. Streaming features.

In the following sections we sketch the translation of various DDlog-1 elements to a pure functional language based on Rust (ok, it's really just sloppy Rust, where I ignore lifetimes and borrowing), and D-Rust, a hypothetical new language that adds syntactic shortcuts on top of Rust. These shortcuts are optional. The programmer may choose to write some or all rules using vanilla Rust.

Relation declarations

DDlog

input relation R1(a: usize, b: usize)
input relation R2(c: usize, d: usize)
output relation R3(: usize, d: usize)

Rust

struct R1 {
    a: usize,
    b: usize
}
struct R2 {
    c: usize,
    d: usize
}
struct R3 {
    e: usize,
    f: usize
}

fn main(r1: Relation<R1>, r2: Relation<R2>): Relation<R3> {
    let r3: Relation<R3> = ... ;

    r3
}

Note that DDlog combines record type declaration, relation type declaration, and relation instance declaration in a single line, while Rust requires three separate declarations. Note also that Rust doesn't require input/output annotations: input relations are relations passed as arguments to the main function, output relations are returned by the function, and intermediate relations are local variables declared inside the function. This enables modularity and reuse, as an output of one function can be used as an input to the other, which amounts to relation transformer composition.

D-Rust

TODO: I am not sure how to define meaningful shorthand syntax for relation type and instance delarations.

Relational operators

We define relational operators as methods of type Relation<>.

impl<T> Relation<T> {
    fn join<T2, K, O>(
        self, other: Relation<T2>,
        k1: fn(T) -> K,
        k2: fn(T2) -> K,
        jfun: fn(T, T2) -> O) -> Relation<O>;

    fn antijoin<T2, K>(
        self, other: Relation<T2>,
        k1: fn(T) -> K,
        k2: fn(T2) -> K) -> Relation<T>;

    fn map<O>(self,
              f: fn(T) -> O) -> Relation<O>;

    fn filter(self,
              f: fn(T) -> bool) -> Relation<T>;

    fn group<K, V>(self,
                   f: fn(T) -> (K, V)) -> Relation<Group<K, V>>;

    fn flatten<T2, I>(self,
                      f: fn(T) -> I) -> Relation<T2>
    where
        I: IntoIterator<Item=T2>;

    fn union(self, other: Relation<T>) -> Relation<T>

    // fixpoint computation that evaluates a recursive fragment of the program:
    //  starts with initial contents of recursive relations and applies `f` until reaching 
    // a fixpoint.
    fn fixpoint(init: [Relation],
                     f: Fn |[Relation]| -> [Relation] ) -> [Relation];
}

join, antijoin

DDlog

R3(x,z) :- R1(x,y),
           R2(y,z).
R4(x) :- R1(x,y),
         not R2(y, _).

Rust

fn main(r1: Relation<R1>, r2: Relation<R2>): (Relation<R3>, Relation<R4>)
{
    let r3 = r1.join(r2, |r1| r1.b, |r2| r2.c, |r1, r2| R3{e: r1.a, f: r2.d});
    let r4 = r1.antijoin(r2, |r1| r1.b, |r2| r2.c)
               .map(|r1| R4{g: r1.a});

    (r3, r4)
}

D-Rust

fn main(r1: Relation, r2: Relation): (Relation, Relation) { let r3 = r1(var x, var y).join(r2(y, var z)) -> R3(x, z); let r4 = r1(var x, var y).antijoin(r2(y)) -> R4(x);

(r3, r4)

}

map, filter

DDlog

R3(x,z) :- R1(x,y),
           x > 0,
           w = x + y,
           R2(w,z).

Rust

let r3 = r1.
         filter(|r1| r1.a > 0).
         map(|r1| (r1.a, r1.b, x + y)).
         join(r2, |(x,y,w)| w, |r2| r2.c, |(x,y,w), r2| R3{e: x, f: r2.d});

D-Rust

let r3 = r1(var x, var y).
         filter(x>0).
         bind(var w = x+y).
         join(r2(w, var z)) -> R3{e: x, f: z};

Here I use bind as a more intuitive synonym for map.

group_by

DDlog

R2(y, m) :- R1(x,y),
            var m = x.group_by(y).min().

Rust

let r2 = r1.map(|r1| (r1.b, r1.a))
           .group()
           .map(|g| R2{c: g.key(), d: g.min()});

D-Rust

let r2 = r1(var x, var y).
         project(x).
         bind(var g = group_by(y)) -> R2{c: y, d: g.min()};

flatten

DDlog

relation R1(x: usize, ys: Vec<usize>)

R2(x,y) :- R1(x, ys),
           var y = FlatMap(ys).

Rust

let r2 = r1.map(|r1| r1.ys.map(|y| (r1.x, y))).flatten().map(|(x,y)| R2{x,y});

D-Rust

let r2 = r1(x, ys).bind(var y = ys.flatten()) -> R2{x, y};

Union, recursion

DDlog

Path(x, y) :- Edge(x,y).
Path(x, z) :- Path(x, w), Edge(w, z).

Rust

let path = Relation::fixpoint(
    [edge.map(|e| Path{e.from, e.to})],
    |[path]| [path.union(path.join(edge, |p| p.to, |e| e.from, |p, e| Path{p.from, e.to}))]);

D-Rust

We may want the compiler to detect recursive components automatically (Datalog-style). Although there is also a lot to be said for forcing the programmer to be explicit about recursion.

// initial assignment
let path = edge(var x, var y) -> Path{x, y};
// recursive step
path += path(var x, var y).join(edge(y, var z)) -> Path{x, z};
ryzhyk commented 3 years ago

Summary of discussion with @Kixiron:

Kixiron commented 3 years ago

Additionally, I think the first-class relations aspect of DDlog is a really promising avenue for both simplifying the language as well as making it more powerful and approachable

ryzhyk commented 3 years ago

Additionally, I think the first-class relations aspect of DDlog is a really promising avenue for both simplifying the language as well as making it more powerful and approachable

Yep, that's part of the appeal. In a way it makes the language a bit too powerful, as we can now write programs that our backend does not support, but that's a much lesser evil.

Kixiron commented 3 years ago

I talked to some ddlog users and they listed a few things

ryzhyk commented 3 years ago

They want it to be well optimized, the user shouldn't have to think about speed, just about writing pure logic

We certainly need better optimizers for both declarative and imperative code. One thing I've learned from working with DDlog though is that oftentimes you can only get performance through macro-optimizations involving code refactoring that you cannot possibly expect the compiler to figure out. I'm just saying, realistically performance will always be a programmer's concern to a large extent.

Completely agree with all other points. Also, you clearly know about more DDlog users than I do :)

ryzhyk commented 3 years ago

I talked to some ddlog users and they listed a few things

Let me move this comment to #1

mihaibudiu commented 3 years ago

They want it to be well optimized, the user shouldn't have to think about speed, just about writing pure logic

We certainly need better optimizers for both declarative and imperative code. One thing I've learned from working with DDlog though is that oftentimes you can only get performance through macro-optimizations involving code refactoring that you cannot possibly expect the compiler to figure out. I'm just saying, realistically performance will always be a programmer's concern to a large extent.

This is because often the optimal program depends on the data that you are feeding the program. The compiler has no way of knowing that. In some remote day perhaps the program runtime will monitor its own execution and observe and report inefficiencies.

ryzhyk commented 3 years ago

This is because often the optimal program depends on the data that you are feeding the program. The compiler has no way of knowing that. In some remote day perhaps the program runtime will monitor its own execution and observe and report inefficiencies.

That's part of the problem, but profile-driven optimization aside, macro-optimizations are just really hard. You are asking the compiler to synthesize an equivalent program that is more optimal in terms of some metric. This is undecidable in theory and in practice.

mihaibudiu commented 3 years ago

Perhaps we should write these RFCs as rst files instead. Would make it easier to comment and to convert into documentation later.

ryzhyk commented 3 years ago

Perhaps we should write these RFCs as rst files instead. Would make it easier to comment and to convert into documentation later.

Does github support .rst in issues? I don't think it does.

mihaibudiu commented 3 years ago

I am suggesting contributing the RFC as a document in a docs directory that we can then review using normal github tools.

Kixiron commented 3 years ago

I would much prefer markdown over rst

ryzhyk commented 3 years ago

There is only one way to settle this: https://www.rpsgame.org/

Kixiron commented 3 years ago

Markdown integrates super easily with mdbook which would allow us one unified site for documentation and rfcs

mihaibudiu commented 3 years ago

md is fine, the point is to have a document instead of an issue.

amytai commented 3 years ago

Is DDlog-2 going to have original DDlog-1 syntax, or are we verging more towards the D-Rust sample language in this RFC? After inspecting LINQ, I have to say that the explicit function calls (filter, map, etc) are more accessible than the sequence of (implicit) operators in DDlog. D-Rust seems to be a nice in-between that isn't as verbose as Rust but isn't as opaque as DDlog.

Having said that, I haven't talked with many DDlog users, so maybe users don't actually find this to be a problem, in which case we stick with original DDlog-1 syntax.

ryzhyk commented 3 years ago

@amytai, the consensus seems to be that we're moving in the D-Rust direction, specifically:

Kixiron commented 3 years ago

I like to look at it more from the angle of treating relations as first-class objects within the language that works in tandem with the more traditional Datalog syntax so that you can switch between writing imperative code and declarative logic over relations

function join_neighbors(nodes: Relation[(Node, NodeData)], edges: Relation[(Node, Node)]): Relation[(NodeData, NodeData)] {
    nodes
        .join(edges)
        .map(|(_src, (data, dest))| (dest, data))
        .join(edges)
        .map(|(_dest, (src_data, dest_data))| (src_data, dest_data))
}

output relation Neighbor(src: NodeData, dest: NodeData)
Neighbor(src, dest) :- let (src, dest) = join_neighbors(Nodes, Edges).
ryzhyk commented 3 years ago

I think we're an agreement. But I think it is nice that relational stuff can be expressed in the purely functional subset of the language after desugaring (even though the compiler will handle functions on relations differently).

ryzhyk commented 3 years ago

PS. In the join_neighbors example there's no reason one wouldn't be allowed to use Datalog syntax inside the function. Likewise, one should be able to write the Neighbor rule in a purely functional syntax.

mihaibudiu commented 3 years ago

Clearly, the second version is harder to understand and easier to get wrong.

I disagree with this statement. It is harder to understand because the column names are meaningless. I think that the join in Datalog is in fact easier to get wrong; if you transpose the column names in some relations no one will even notice.

I believe that the LINQ-like syntax is great. If there is a systematic way to translate Datalog into LINQ we should do it. Do we allow rules that mix the two syntaxes in a single rule? I propose to start with just the LINQ syntax and add the Datalog layer later, as a front-end rewriting pass. The rewriting rules will become part of the documentation of the semantics of Datalog. The semantics of LINQ is much easier to explain.

mihaibudiu commented 3 years ago

Speaking of this, I would very much like a design of the compiler that makes is easy to insert passes like this - desugaring passes, optimization passes. This makes the compiler extensibility and incremental development much simpler. Perhaps we should start a new issue about the compiler architecture?

ryzhyk commented 3 years ago

I agree, we should start with the general-purpose subset. It's not really LINQ though, just a general-purpose functional language.

It is not true in my opinion that LINQ is always better than Datalog. Long sequences of joins are tedious and error-prone. Like Datalog, LINQ does not enforce consistent naming of columns, but also forces you to manually form tuples and then manually bind tuple fields to variables.

amytai commented 3 years ago

PS. In the join_neighbors example there's no reason one wouldn't be allowed to use Datalog syntax inside the function.

Ah cool I hadn't thought of this. So then Datalog-syntax will be syntactic sugar that should be expressible via #3 and common LINQ-style operators such as join, filter will also (?) be syntactic sugar (i.e. non-UDF relational operators exposed via some std lib).

mihaibudiu commented 3 years ago

Yes, that's right, the canonical syntax is the LINQ-style, since it is unambiguous. The others are just syntactic sugar. We already have a precedent with the FLWOR-flavored loops in DDlog, which are converted to DDlog right after parsing.