substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.19k stars 155 forks source link

Add Insert/Update/Delete basic functionality to specification #128

Closed GavinRay97 closed 1 year ago

GavinRay97 commented 2 years ago

This would be really useful for Substrait backends that want to function like a traditional RDBMS rather than query-only/for analytics.

jacques-n commented 2 years ago

This makes sense. I'm not sure what the best representation would be. Do you have something in mind?

GavinRay97 commented 2 years ago

Trying to think about how this fits in with the current specification and Proto files

It sort of falls outside of the current design because Insert/Update/Delete don't tie to the concept of a Relation very well I think?

From a capabilities perspective, I think that just the ability to specify a target table name, and a set of (column -> value) mappings should be good enough?

Something like:

var insertData = [
  ["id", "email", "age"],
  [1, "user1@site.com", 27],
  [2, "user2@site.com", 45],
]

// Allow fully-qualified name, or just table name, etc
Substrait.generateInsertStatement(["mydb", "myschema", "user"], insertData)

I'm not very familiar with common DBMS design patterns or things of this nature so this is a naive idea.

jacques-n commented 2 years ago

I think that inserts can be expressed as writes that are not table creating (feels more like a property of writing than a new operation).

For update and delete, can you look at how Calcite structures those operations logically? That might be a good way to start modeling.

curino commented 2 years ago

I am starting to work on it with @jcamachor. I agree with @jacques-n that it can fit reasonably well. The use case is for example to be able to feed workload analysis / view selection algorithms that need visibility not only on selects but also in the write workload of the system. This could also help a lot for the provenance scenarios that Ashvin Agrawal is looking at (integration with Atlas).

We will have a proper proposal out in the coming days, but we are starting to think of something like:

message PlanRel {
  oneof rel_type {
    // Any relation (used for references and CTEs)
    Rel rel = 1;
    // The root of a relation tree
    RelRoot root = 2;
    // An update/insert/delet
    ModOp mod_op = 3;
  }
}

I.e., add a ModOp option in the Plan top level, and then define ModOp as:

  // An Insert operation
  message Insert {
    // the table to insert into
    NamedTable table = 1;
    // A query (which can also be as set of values),
    // with named fields matching all columns that
    // do not have defaults in the table
    RelRoot query = 2;

    //TODO Add replace semantics

  }

 [...]

  // A modifying operation
  message ModOp {
    oneof operation {
      Insert ins = 1;
      Delete del = 2;
      Update upd = 3;
      //TODO Add Upsert semantics
    }
  }

To be clear this is just a very short brainstorming so far, we plan to write it out precisely and try to think through a bunch of the corner cases for INSERT, DELETE, UPDATE, UPSERT.

@GavinRay97 I think we need to support all the cases like:

INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
WHERE condition;

Hence we are thinking of re-using the RelRoot with semantics where the named columns are mapped with the corresponding columns in the insert table (allowing for inserts that only specify mandatory column values and use defaults elsewhere).

Again, this is just a start, but something we will look into probably next week.

curino commented 2 years ago

I think we can close/resolve this issue. Others can follow to provide more advanced functionalities.