vmware-archive / database-stream-processor

Streaming and Incremental Computation Framework
Other
220 stars 19 forks source link

Comparison between DBSP engine and differential dataflow #365

Open nooberfsh opened 1 year ago

nooberfsh commented 1 year ago

Hi, I'm very interested in incremental computation and I have learned dd for a while. I found dbsp some days ago and it seems can do the same things as dd. What's differences between them, when should we use dbsp?

mihaibudiu commented 1 year ago

This is a complicated question because it can be interpreted in many ways and the answer really depends on your needs.

DD is an open-source library, but, as you probably know, it is the core of the commercial Materialize.com offering. I don't know much about that product, so I won't comment about it. My answer applies to the open-source side.

The DBSP paper https://github.com/vmware/database-stream-processor/blob/main/doc/vldb23/main.pdf does have a short comparison with DD, but that is really about the theoretical side. I quote from Section 9:

DBSP is also related to Differential Dataflow (DD) [39, 42] and its theoretical foundations [3] (and recently [17, 38]). DD’s compu- tational model is more powerful than DBSP, since it allows time values to be part of an arbitrary lattice. In fact, DD is the only other framework which we are aware of that can incremental- ize recursive queries as efficiently as DBSP does. In contrast, our model uses either “linear” times, or nested time dimensions via the modular lifting transformer (↑). DBSP can express both incremen- tal and non-incremental computations. Most importantly, DBSP comes with Algorithm 4.6, a syntax-directed translation that can convert any expressible query into an incremental version — in DD users have to assemble incremental queries manually using incre- mental operators. (materialize.com offers a product that automates incrementalization for SQL queries based on DD. Differential Data- log [45] does it for a Datalog dialect.) Unlike DD, DBSP is a modular theory, which easily accommodates the addition of new operators: as long as we can express a new operator as a DBSP circuit, we can (1) define its incremental version, (2) apply the incrementalization algorithm to obtain an efficient incremental implementation, and (3) be confident that it composes with any other operators.

In practical terms, in our benchmarks we are slightly faster than DD (10-20%). We also provide an open-source SQL compiler on top of DBSP (https://github.com/vmware/sql-to-dbsp-compiler). We do believe that the modularity of our theory does allow easier extensions, and we have added some operators that DD does not have (e.g., Window-based aggregation).

In terms of debugging, DD has been around for a long time, so supposedly it is more tested. However, the core logic of our code is simpler; these lattice-based times are not easy to understand. Our main weakness right now may be documentation, which we are planning to improve significantly.

I will be glad to give a more precise answer if you can elaborate on your use cases.

mihaibudiu commented 1 year ago

Let me add though, that if you are interested in incremental computation in general you must read our paper even if you don't plan to use the software. (Or you can watch the video in our README for a simplified description.) It provides a new perspective on this problem, which we believe opens up many new interesting theoretical and practical directions.