vmware / differential-datalog

DDlog is a programming language for incremental computation. It is well suited for writing programs that continuously update their output in response to input changes. A DDlog programmer does not write incremental algorithms; instead they specify the desired input-output mapping in a declarative manner.
MIT License
1.38k stars 119 forks source link

A running time question #1014

Open helloqirun opened 3 years ago

helloqirun commented 3 years ago

Hi,

I have a quick question about the ddlog performance. Please see the following.

I have a .dl file that contains 4000+ lines. But the .dat file is fairly small. Please see the test files at https://github.com/helloqirun/ddlog_test

The timestamp is very small (0.21 seconds), as expected. However, the overall running time for dyck_cil is a bit long (~19 seconds).

$ ddlog -v
DDlog v0.41.0 (cadc3f1a66b03766049ab125ea5dddae19e537ba)
Copyright (c) 2019-2020 VMware, Inc. (MIT License)

$ time dyck_ddlog/target/release/dyck_cli < input.dat
Timestamp: 214690070

real    0m18.992s
user    0m24.924s
sys 0m7.078s

Is the long-running time for dyck_cil an expected behavior? Thanks!

ryzhyk commented 3 years ago

@helloqirun , thanks for reporting the issue. That the timestamp is small suggests that ddlog spends this entire time instantiating the enormous dataflow graph with thousands of rules and relations. In addition, 0.21s is a lot of time to perform a trivial computation, which likely means that differential dataflow also struggles with this huge graph at runtime.

I see that @Kixiron self-assigned this issue, so he may have a more detailed explanation soon.

Just curious: how did you come up with this program? It looks like auto-generated code :)

helloqirun commented 3 years ago

Thanks @ryzhyk and @Kixiron ! Yes, the .dl file is auto-generated. It essentially contains a Dyck language of 1600+ parentheses, defined by the grammar "S-> SS | (_i S )_i | empty". I was using souffle and switched to ddlog because it supports incremental updates.

ryzhyk commented 3 years ago

I am not familiar with dyck, but here is a equivalent program that adds an idx column to A and B to model 1600 individual relations used in your code. This should be way more efficient.

typedef node = u32

input relation RA(idx: u16, s: node, t: node)
input relation RB(idx: u16, s: node, t: node)

input relation EMPTY(s: node, t: node)

output relation S(s: node, t: node)

S(x, y) :- EMPTY(x, y).
S(x, y) :- S(x,z), S(z,y).
S(x, z) :- RA(i, x, a), S(a, b), RB(i, b, z).
Kixiron commented 3 years ago

This is a really interesting case that I'm not fully done investigating so I'll update whenever I finish looking into it. From what I can see now, this looks like a really interesting and strangely pathological case due to arrangements, there's hundreds of thousands of arrangement-related events produced by the code but very few timely events which is super interesting. Again, not finished looking into things so I'll update y'all when I have more info

helloqirun commented 3 years ago

I am not familiar with dyck, but here is a equivalent program that adds an idx column to A and B to model 1600 individual relations used in your code. This should be way more efficient.

typedef node = u32

input relation RA(idx: u16, s: node, t: node)
input relation RB(idx: u16, s: node, t: node)

input relation EMPTY(s: node, t: node)

output relation S(s: node, t: node)

S(x, y) :- EMPTY(x, y).
S(x, y) :- S(x,z), S(z,y).
S(x, z) :- RA(i, x, a), S(a, b), RB(i, b, z).

Thanks. This one runs much faster!