vmware / database-stream-processor-compiler

Infrastructure to run programs written in high-level languages on top of the Database Stream Processor (DBSP) runtime.
Other
14 stars 2 forks source link

[RFC] Interacting with DDlog programs #11

Open mihaibudiu opened 2 years ago

mihaibudiu commented 2 years ago

Interacting with DDlog programs

A DDlog program is compiled into a library. The task of sending and receiving data from DDlog programs is left to applications built on top.

Interaction with DDlog programs can be made either through the CLI or through API calls. The CLI operations are in fact interpreted by the CLI tool and converted into API calls. The semantics of these API calls is currently not well specified.

Here are things that should be clearly specified:

mihaibudiu commented 2 years ago

This is related to https://github.com/vmware/differential-datalog/issues/372

ryzhyk commented 2 years ago

Some more thoughts to add to this list:

mihaibudiu commented 2 years ago
mihaibudiu commented 2 years ago

DDlog is a streaming database, but by adding indexes and their lookup APIs, and primary keys, we convert it into something that is more similar to a traditional DB. But it is not really a traditional DB, so people who expect it to behave as one may be surprised.

ryzhyk commented 2 years ago

add a distinct after each input relation. Then each delta is by definition correct (modulo primary keys).

This won't work as insert->insert->delete behaves like an insert, whereas the "expected" behavior (assuming the user expects set semantics is a no-op)

enforce no invariants and assume that the users' updates are all legal (e.g., because they come from a real DB or another DDlog instance, which enforces the invariants itself).

From experience, many users struggle with this, which is why the current semantics was introduced in the first place. This is the same issue that Frank talks about in the upsert blog.

specify a preprocessing module logically before DDlog, which maintains the input table and rejects illegal updates

This preprocessing module will end up maintaining a private snapshot or input state as we do now. upserts avoid this overhead.

mihaibudiu commented 2 years ago

If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).

ryzhyk commented 2 years ago

The problem with your proposal about indicating which updates fail is that the updates actually don't commute. So skipping some will give a completely different semantics to the update set.

It is up to the user to rollback a transaction after a failed update.

mihaibudiu commented 2 years ago

If all failed transactions are supposed to be rolled back, why not do it automatically?

ryzhyk commented 2 years ago

If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).

Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what apply(delta) means.

mihaibudiu commented 2 years ago

And if they are not supposed to be rolled back, the state of the DB after failed transaction should be clearly defined.

ryzhyk commented 2 years ago

If all failed transactions are supposed to be rolled back, why not do it automatically?

They are not. It's up to the client. And yes, the state needs to be clearly defined.

mihaibudiu commented 2 years ago

If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).

Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what apply(delta) means.

By essentially defining the semantics of an update in this way: take a delta, add it to the input table, and apply a distinct. It is not a traditional DB view, but it is clear.

ryzhyk commented 2 years ago

If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).

Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what apply(delta) means.

By essentially defining the semantics of an update in this way: take a delta, add it to the input table, and apply a distinct. It is not a traditional DB view, but it is clear.

I see. This still doesn't solve the insert->insert->delete problem though if each operation happens in a separate transaction.

Kixiron commented 2 years ago

We could draw inspiration from Materialize, the way that it handles internal or user-produced errors is by producing parallel error tables (read: relations) for outputs, allowing it to incrementally process errors (and to incrementally fix them as well). The basic structure is that one relation is filled with Ok values and one with Errs, if the Err table is non-empty then the Ok table has an indeterminate value. Ideally this could also allow for rollback to happen if engineered correctly and it should automatically address insert->insert->delete since it'd maintain weights like DD does normally

ryzhyk commented 2 years ago

Yeah, status tables are a nice way to report errors incrementally, especially if we want to support a larger class of consistency constraints.

The problem with insert->insert->delete though is that we don't want to maintain weights in the normal way. Clients that rely on the upsert semantics expect the second insert to be a no-op, so that the last delete destroys the record. This fundamentally requires keeping a snapshot of the input tables, and DD's upsert operator should do this reasonably efficiently.

mihaibudiu commented 2 years ago

Reading this paper:

  author =   {McSherry, Frank and Lattuada, Andrea and
                  Schwarzkopf, Malte and Roscoe, Timothy},
  title =    {Shared Arrangements: Practical Inter-Query Sharing
                  for Streaming Dataflows},
  url =      {http://www.vldb.org/pvldb/vol13/p1793-mcsherry.pdf},

suggests an interesting solution: shared arrangements for all inputs. This could be a compilation option. This solution could provide several benefits:

In this model a DDlog computation is really a two-stage process: input arrangements followed by the actual dataflow graph. The input arrangements have a rich API on both sides: to the outside to perform transactions, and to the inside to supply deltas and to replay data. The dataflow graph inside has a simple API, where everything is a delta.

Kixiron commented 2 years ago

That's pretty much what the upsert stuff that Leon's talking about is

mihaibudiu commented 2 years ago

I thought about this more and I hope I have a design. It's not final, but I hope it clarifies some dimensions. I will write a document about it.