schema: int representation for enums

rvagg commented 5 years ago

I've done some non-trivial surgery here to make this happy. I copied the union representation structure in schema-schema and removed the innards of enum, pushing the "members" into the representations. So in schema-schema we have this:

type RepresentationKind enum {
    | "bool"
    | "string"
    | "bytes"
    | "int"
    | "float"
    | "map"
    | "list"
    | "link"
}

maps to this

        "RepresentationKind": {
            "kind": "enum",
            "representation": {
                "string": {
                    "bool": null,
                    "string": null,
                    "bytes": null,
                    "int": null,
                    "float": null,
                    "map": null,
                    "list": null,
                    "link": null
                }
            }

And as per the docs, I'm proposing that this (from Filecoin):

type Options enum {
    | Blocks 0
    | Messages 1
    | BlocksAndMessages 2
} representation int

(note the unquoted strings this time)

would map to this:

        "Options": {
            "kind": "enum",
            "representation": {
                "int": {
                    "Blocks": 0,
                    "Messages": 1,
                    "BlocksAndMessages": 2
                }
            }

The mapping of {String:Int} isn't as convenient for lookups but I'm sure codegen could make that cost go away.

I also made some cosmetic alterations to representations.md while I was in there.

rvagg commented 5 years ago

Cleaned up a bit, removed quotes from enum values as per discussion above.

Put up a JS implementation, test fixtures are interesting to look at:

https://github.com/ipld/js-ipld-schema/blob/rvagg/enum-int/test/fixtures/bulk/enum-int.yml

and string enum:

https://github.com/ipld/js-ipld-schema/blob/rvagg/enum-int/test/fixtures/bulk/enum.yml

@warpfork this needs your blessing.

warpfork commented 5 years ago

My slightly-alternate proposal would be to stay strong on membership being stored in the type body, and then handling any custom string names or int representations in much the same pattern as how we treat 'rename' directives on struct fields.

E.g.,

type Woo enum {
    | Nope "n"
    | Yep "y"
    | Maybe
}

would result in:


"SimpleEnum": {
    "kind": "enum",
    "members": {
        "Nope": null,
        "Yep": null,
        "Maybe": null
    },
    "representation": {
        "string": {
            "Nope": "n",
            "Yep": "y"
            // note lack of 'maybe' key
        }
    }

(The behavior for ints would be similar, except since there's no sane default, it would be invalid to be lacking a representation map entry for any of the members.)

This is more redundant in the case of many custom names (or, yes, for all 'int' representations). However, there's a choice to weigh here: is avoiding redundancy in the schema AST something we need to significantly optimize for, or do we give priority to the "type semantics in the type block, serial semantics in the representation block" rule?

I tend to think avoiding redundancy in the AST ranks pretty low.

There might be some implications for round-tripping DSL->AST->DSL:

type Whee enum {
    | Foo "Foo" # redundant -- noncanonical?
    | Bar
}

But perhaps the guideline of "act like struct field renames" should again apply. In this case that would mean redundantly stating the same string as would be inferred by default is simply not handled specially (and is thus retained).

warpfork commented 5 years ago

Speaking of the "type semantics in the type block, serial semantics in the representation block" rule --

There's an argument to be made that we might want to put the strings or ints for the representation in parens, for consistency again with how we use parans elsewhere (namely struct fields and the 'rename' feature there) to denote things that are representational, but got syntactically translocated up to the type block in the DSL for human-author convenience.

E.g.:

type Woo enum {
    | Nope (serial "n")
    | Yep (serial "y")
    | Maybe
}

and

type Whee enum {
    | Foo (serial 1)
    | Bar (serial 4)
} representation int

Otoh, that's getting awfully high in character count, and since I can't imagine any other features appearing in the future which would also reside in those parens, it seems net boilerplatey. So, I'm posting this as a thinking-out-loud, but not sure I'm actually advocating for it.

rvagg commented 5 years ago

I'm fine with the JSON proposal except for the discrepancy with the way we handle unions. Can you think up some words re why we're doing the double-up in enums but unions get everything shifted into their representations?

Re | Foo (serial 1), I'm fine with that and don't think it's an entirely unreasonable ask for schema authors. It's nice and clear what's going on too (I wish I could come up with a better word than "serial" though).

Note that in implicit we are quoting the value, regardless of kind, so maybe we need to follow that here too? | Foo (serial "1"), where representation int is left to do the work of forcing these to integers.

One possibility to simplify this is to just drop the keyword and paren the serialized value: | Foo ("1"), or | Nope ("nopez"). That's a mild departure from previous parsing rules for parens but not a terrible departure and would still allow additional representation rules be appended after a first quoted one.

Let's try that on for size:

type Woo enum {
    | Nope ("n")
    | Yep ("y")
    | Maybe
}

type Whee enum {
    | Foo ("1")
    | Bar ("4")
} representation int

warpfork commented 5 years ago

discrepancy with unions: shoot. You're right. That is discrepant, and we should fix that one way or another.

I seem to have originally written unions this way as well, but then dropped it way back in https://github.com/ipld/go-ipld-prime/commit/763174ea9f27140eaf7d9a7becd85b155124944f with... not as much reasoning in the commit message as I usually expect from myself. The parent commit https://github.com/ipld/go-ipld-prime/commit/4bd496a92c54e3891aa371c1a5278c9d4d2f6e8f is where union guts were introduced for the very first time at all (!), so this is pretty old stuff in general, and maybe worth a review at this point. It looks like map key questions were a part of it, but choosing to have a map at the type level with arguable redundancy means that keys are answered, so, uh, I'm not sure my past self's logic was very consistent here.

So you raise a good point, and I'm not sure which way we should hew and which of these two should change to remove the discrepancy. More thought needed.

parens: cool, let's do it then.

I also think your idea to shorten them down to one value in this case sounds completely fine. :+1: (It probably introduces another small mode in the syntax parser? But it's not ambiguous or anything. Go for it.)

quotes around the value: I'd actually officially like to punt this choice (both here and how implicit does it) to you :)

rvagg commented 5 years ago

discrepancy with unions

Seems like we need to resolve this now then: (1) adopt different approaches for unions and enums, (2) use the current unions approach for both, or (3) use your proposed approach here for both.

My guess is that you switched because of the duplication and verbosity in JSON; combined with the fact that the different types of unions are almost entirely different beasts so their overlapping needs are minimal. Enums are a bit different though, we're not really proposing entirely different things, just that there's a mapping to named programatic things and in one form the serial form is a string, which may or may not be the same as the name, and in another form it's an int.

quotes around the value

Here's one key thing about that question -- if we go for the shorthand version then we make it easier for the parser if we choose to extend the options in parens for enums. If we go unquoted then we have a hard time differentiating. Consider a string representation mapping unquoted: | Foo (f), then we introduce an option like | Foo (f alternate foo) where alternate lets you specify another string form in serialization that would also be a Foo (so a "f" or "foo" = Foo). Now introduce an enum that serializes to "alternate": | Alt (alternate) - the signal to the parser that this is a serialized string is that there's no second token in the parens, what about | Alt (alternate alternate)? That should be valid and could be parsed, but maybe it was a user error? The lack of clarity is a problem here IMO. Which is why I think I'm in favour of quoting, regardless of int or strings: | Foo ("f" alternate foo), | Yes ("1").

Quoting non-strings does irk me but it's a nice signal to the parser, and possibly the author and reader, that we're breaking out of schema-language into serialized forms.

warpfork commented 5 years ago

Still not 100% confident on the best and most consistent way through on the discrepany with unions. You've summed it up well. I'd lean towards saying (3) long run, but we might as well do (1) at present as that's the smaller diff, and if we learn something useful from a brief period running with both strategies, that's great and we can change our minds. But if you have a strong opinion for the other choices, I'd support it.

Complete thumbs up to your comments on the quoting and DSL format. | Yes ("1") reads right to me.

rvagg commented 5 years ago

OK, updated, will merge this if there are no comments in the next day or two.

As per test fixtures in https://github.com/ipld/js-ipld-schema/pull/16 (which is updated according to the latest changes here):

  type SimpleEnum enum {
    | Foo ("0")
    | Bar ("1")
    | Baz ("100")
  } representation int

yields

  {
    "types": {
      "SimpleEnum": {
        "kind": "enum",
        "members": {
          "Foo": null,
          "Bar": null,
          "Baz": null
        },
        "representation": {
          "int": {
            "Foo": 0,
            "Bar": 1,
            "Baz": 100
          }
        }
      }
    }
  }

and

  type SimpleEnum enum {
    | Foo
    | Bar
    | Baz
  }
  type SimpleEnumWithValues enum {
    | Foo ("f")
    | Bar
    | Baz ("b")
  }

yields

  {
    "types": {
      "SimpleEnum": {
        "kind": "enum",
        "members": {
          "Foo": null,
          "Bar": null,
          "Baz": null
        },
        "representation": {
          "string": {}
        }
      },
      "SimpleEnumWithValues": {
        "kind": "enum",
        "members": {
          "Foo": null,
          "Bar": null,
          "Baz": null
        },
        "representation": {
          "string": {
            "Foo": "f",
            "Baz": "b"
          }
        }
      }
    }
  }

rvagg commented 5 years ago

oh, and I'm keeping "representation": { "string": {} } in there for the default, like we do for struct-map.

rvagg commented 5 years ago

merged

ipld / specs

schema: int representation for enums #196