substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.14k stars 148 forks source link

docs: clarify language around user-defined types #604

Closed vbarua closed 4 months ago

vbarua commented 6 months ago

Context: https://substrait.slack.com/archives/C02D7CTQXHD/p1707328717290129

Inlining for posterity

Victor Barua I'm trying to understand some of the language around User-Defined Types here and how it related to literal value in the protobufs. "The structure field of a type is only intended to inform systems that don’t have built-in support for the type how they can transfer the data type from one point to another without unnecessary serialization/deserialization and without loss of type safety." and "The structure field is optional. If not specified, the type class is considered to be fully opaque. This implies that a systems without built-in support for the type cannot manipulate values in any way, including moving and cloning." Questions I have in no particular order:

  1. Does the struct definition define a particular encoding in a system, or does it just define the wire format?
  2. Why does having a structure field allow a system to transfer data without unnecessary serde? My assumption is that without structure field, a system can serde an UDT all and can only pass them around as opaque blobs.
  3. If the struct is not specified, why can't a system clone and move values? Relatedly, what does cloning and moving mean in a system?

David Sisson We've never enforced the encoding in a system before so I'd believe it's just the wire format. The structure field should obviate the need for serde when being passed to another Substrait consumer but it's up to that consumer to serde it if it doesn't fit their internal format. Not sure why clone and move are off limits -- a Substrait gateway could just pass along the any protobuf along without touching it just fine. An example of a UDT is the geometry extension -- instead of keeping track of a complicated type such as a shape with multiple lines it's just an opaque type with routines for querying more specific type information.

Victor Barua "a Substrait gateway could just pass along the any protobuf along without touching it just fine." That was what I was thinking, hence confusion around the clone and move stuff. Someone at work is working on serializing some of our user-defined internal types and they and I realised we had opposite interpretations of some of the wording here :sweat:

Weston Pace The PR adding these statements simply refers to a slack discussion which is beyond the slack history :cry:

vbarua commented 4 months ago

buf is complaining about

proto/substrait/algebra.proto:895:9:Field "2" on message "UserDefined" moved from outside to inside a oneof.

I ran a couple of quick checks on this.

First, decoding a binary encoded version of the message against the old and new protobufs:

substrait git:(main) echo 8a023612340a2e747970652e676f6f676c65617069732e636f6d2f676f6f676c652e70726f746f6275662e496e74363456616c75651202082a | xxd -r -p | protoc --proto_path=proto --decode substrait.Expression.Literal proto/substrait/algebra.proto
user_defined {
  value {
    type_url: "type.googleapis.com/google.protobuf.Int64Value"
    value: "\010*"
  }
}

substrait git:(vbarua/user-defined-type-thonks) echo 8a023612340a2e747970652e676f6f676c65617069732e636f6d2f676f6f676c652e70726f746f6275662e496e74363456616c75651202082a | xxd -r -p | protoc --proto_path=proto --decode substrait.Expression.Literal proto/substrait/algebra.proto
user_defined {
  value {
    type_url: "type.googleapis.com/google.protobuf.Int64Value"
    value: "\010*"
  }
}

The decoded message is the same across both version. Checking the JSON, both the old and new protobufs generate the same JSON

{
  "userDefined": {
    "typeReference": 0,
    "value": {
      "@type": "type.googleapis.com/google.protobuf.Int64Value",
      "value": "42"
    },
    "typeParameters": []
  },
  "nullable": false,
  "typeVariationReference": 0
}

The rule it is triggering is FIELD_SAME_ONEOF.

This checks that no field moves into or out of a oneof or changes the oneof it's a part of. Doing so is almost always a generated source code breaking change. Technically there are exceptions with regard to wire compatibility, but the rules are complex enough that it's safer to never change a field's presence inside or outside a given oneof.

In this case, following the link shows that we have something similar to one of the exception cases:

Move fields into or out of a oneof: You may lose some of your information (some fields will be cleared) after the message is serialized and parsed. However, you can safely move a single field into a new oneof and may be able to move multiple fields if it is known that only one is ever set.

Effectively the issue that they are guarding against is moving a field into a oneof, and then if that field is set it will clear other messages in the oneof. In this case, as there is only 1 existing field (value) and we are adding a new field (struct), there should be no existing code that tries to set both.

IMO we can ignore this for this PR.