Open mihaibudiu opened 3 years ago
This is related to https://github.com/vmware/differential-datalog/issues/372
Some more thoughts to add to this list:
apply_changes()
method never fails and returns lists of successfully and unsuccessfully applied updates, rather than a single status flag. Or perhaps the API should allow the caller to choose failure semantics for each invocation (ignore duplicates or fail when encountering a duplicate).upsert
operator instead. In fact this is something we should fix in DDlog-1 too: https://github.com/vmware/differential-datalog/issues/1101.DDlog is a streaming database, but by adding indexes and their lookup APIs, and primary keys, we convert it into something that is more similar to a traditional DB. But it is not really a traditional DB, so people who expect it to behave as one may be surprised.
add a distinct after each input relation. Then each delta is by definition correct (modulo primary keys).
This won't work as insert->insert->delete behaves like an insert, whereas the "expected" behavior (assuming the user expects set semantics is a no-op)
enforce no invariants and assume that the users' updates are all legal (e.g., because they come from a real DB or another DDlog instance, which enforces the invariants itself).
From experience, many users struggle with this, which is why the current semantics was introduced in the first place. This is the same issue that Frank talks about in the upsert blog.
specify a preprocessing module logically before DDlog, which maintains the input table and rejects illegal updates
This preprocessing module will end up maintaining a private snapshot or input state as we do now. upserts avoid this overhead.
If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).
The problem with your proposal about indicating which updates fail is that the updates actually don't commute. So skipping some will give a completely different semantics to the update set.
It is up to the user to rollback a transaction after a failed update.
If all failed transactions are supposed to be rolled back, why not do it automatically?
If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).
Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what apply(delta)
means.
And if they are not supposed to be rolled back, the state of the DB after failed transaction should be clearly defined.
If all failed transactions are supposed to be rolled back, why not do it automatically?
They are not. It's up to the client. And yes, the state needs to be clearly defined.
If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).
Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what
apply(delta)
means.
By essentially defining the semantics of an update in this way: take a delta, add it to the input table, and apply a distinct. It is not a traditional DB view, but it is clear.
If you have the apply(delta) API then the distinct will work fine (but will not handle the primary keys).
Not sure I understand. How will apply(delta) solve the insert->insert->delete problem. Maybe I don't understand what
apply(delta)
means.By essentially defining the semantics of an update in this way: take a delta, add it to the input table, and apply a distinct. It is not a traditional DB view, but it is clear.
I see. This still doesn't solve the insert->insert->delete problem though if each operation happens in a separate transaction.
We could draw inspiration from Materialize, the way that it handles internal or user-produced errors is by producing parallel error tables (read: relations) for outputs, allowing it to incrementally process errors (and to incrementally fix them as well). The basic structure is that one relation is filled with Ok
values and one with Err
s, if the Err
table is non-empty then the Ok
table has an indeterminate value. Ideally this could also allow for rollback to happen if engineered correctly and it should automatically address insert->insert->delete since it'd maintain weights like DD does normally
Yeah, status tables are a nice way to report errors incrementally, especially if we want to support a larger class of consistency constraints.
The problem with insert->insert->delete though is that we don't want to maintain weights in the normal way. Clients that rely on the upsert
semantics expect the second insert to be a no-op, so that the last delete
destroys the record. This fundamentally requires keeping a snapshot of the input tables, and DD's upsert
operator should do this reasonably efficiently.
Reading this paper:
author = {McSherry, Frank and Lattuada, Andrea and
Schwarzkopf, Malte and Roscoe, Timothy},
title = {Shared Arrangements: Practical Inter-Query Sharing
for Streaming Dataflows},
url = {http://www.vldb.org/pvldb/vol13/p1793-mcsherry.pdf},
suggests an interesting solution: shared arrangements for all inputs. This could be a compilation option. This solution could provide several benefits:
In this model a DDlog computation is really a two-stage process: input arrangements followed by the actual dataflow graph. The input arrangements have a rich API on both sides: to the outside to perform transactions, and to the inside to supply deltas and to replay data. The dataflow graph inside has a simple API, where everything is a delta.
That's pretty much what the upsert stuff that Leon's talking about is
I thought about this more and I hope I have a design. It's not final, but I hope it clarifies some dimensions. I will write a document about it.
Interacting with DDlog programs
A DDlog program is compiled into a library. The task of sending and receiving data from DDlog programs is left to applications built on top.
Interaction with DDlog programs can be made either through the CLI or through API calls. The CLI operations are in fact interpreted by the CLI tool and converted into API calls. The semantics of these API calls is currently not well specified.
Here are things that should be clearly specified:
start
andend
within a transaction matters. The spec indicates that updates are buffered and applied at commit time. Are API calls such as ddlog_clear_relation also buffered?