ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Defining Any and its behaviors #318

Closed warpfork closed 2 years ago

warpfork commented 3 years ago

We need to define and document behavior of Any, a named type that can be used in IPLD Schemas to say, well, "anything" goes here. This is both a documentation and a design issue.

As https://github.com/ipld/specs/issues/301 comments, Any is something we can implement in a very non-magical way, simply by specifying it in the prelude as a kinded union:


type Any union {
    | Bool   bool
    | Int    int
    | Float  float
    | String string
    | Bytes  bytes
    | Map    map
    | List   list
    | Link   link
} representation kinded

Doing this as a bog-standard kinded union has very pleasing parsimony.

However, it has some interesting implications. Namely, since it's just a regular kinded union in the IPLD Schema system, it's got the dichotomy of type-level and representation-level semantics. And generally, what users want and expect of this is the representation-level semantics. But they get... more.


So the question is essentially: is this definition of Any actually good?

The general rule of how unions work is: the logical-level always behaves as a map, where the keys are the names of the types. The representation-level, of course, varies wildly (e.g. kinded vs keyed unions behave nothing alike, representationally).

So, if Any is implemented as a kinded union, it does exactly that: the representation-level behaviors are "put anything here", as desired; and the type-level semantics... well, it acts like a map: sometimes the map will have an entry with the key "String", sometimes it'll have an entry instead with the key "Map", etcetera.

There are some POVs from which I don't think it violates the principle of least surprise. But, there are definitely also POVs from which it does. Namely, a typed-Any doesn't really work the same as an untyped-basicnode.Any.

So maybe something not great is going on here with this design. For somegenoutput.Any to behave differently than basicnode.Any is a significant smell.

One option -- Option 0, if you will -- is to roll with this (and that's what the current tip of go-ipld-prime codegen is actually doing. Or, er, well, it's what it recommends doing -- since the prelude is still just a suggestion there, not baked in. Still! Moving on!).


One option for dealing with this is changing the behavior of the type level node for kinded unions. We could declare that for kinded unions, the type-level behavior is actually the same as the representation-level behavior.

I'm a pretty dubious of this: it would break one of the high level design rules of the schema system: that type-level behaviors are defined entirely in terms of the type kind, and have total independence from their representation. (On the other hand, this is admittedly a pretty compelling case.)


Another option is that we make a blessed magic type for Any. We give this magic type the behaviors that we want (essentially, it acts like basicnode.Any: there's no difference between it's type-level and representational-level behaviors, and the representational behavior is identical to the kinded union specified above.) In this case, type Any [...] wouldn't end up in the prelude at all: it would be too magical to describe.

This is quite a bit less parisimonious than just defining Any as a plain, non-magical kinded union. But I really don't have any arguments against it other than that.


The third option is to carry that previous idea a bit further: we could introduce a new kind named any (e.g. it's a conceptual sibling of list|string|map|bytes|etc). We would imbue this kind with the behaviors we want: "put anything here" and "the type level and representation level semantics are the same". The prelude would then also simply contain type Any any (just like it contains type String string).

This seems pretty good. We're not breaking the high-level rule about type/representation independence with this approach, since we just introduce a whole new kind and it just "happens" to have specified the type behaviors and representation (of which there is only one) behaviors as the same.

Probably the weirdest thing about this idea is that ... hmm. No, nothing's weird about this at all. any would be a type-kind, not a representation-kind (e.g., it's in the enum that also contains struct|union|etc), and adding another member to that enum isn't problematic at all. (There was a moment where I thought we'd be stuct introducing a third enum which included this extra weird member, but... no, I think that was sloppy thinking; it's actually simpler than that.)


Okay. I think I talked myself into favoring Option 3 here in the course of writing this issue :) Anyone else have any thoughts on this?

Thanks to @mvdan and @willscott for both independently poking me about something feeling a bit weird here, and thus encouraging this further thought.

mvdan commented 3 years ago

Option 3 sounds good to me! It would remove a footgun and make my ADL code a bit nicer.

rvagg commented 3 years ago

I don't mind rolling with special casing for Any but it might be worth dwelling on the source of the problem - which I don't think is in Any itself, but in union semantics in general. They're always going to have this weirdness about them and Any is just the most general case. It's the primary part of the model where the beauty of the abstractions break down a little and the light from underneath shows through the cracks. For example, you could say this about any kinded union, really:

So, if Any is implemented as a kinded union, it does exactly that: the representation-level behaviors are "put anything here", as desired; and the type-level semantics... well, it acts like a map: sometimes the map will have an entry with the key "String", sometimes it'll have an entry instead with the key "Map", etcetera.

tbh I think this has a fair bit to do with Go, and also the particular model you're pursuing with ipld-prime, not that I think it's flawed or can easily describe a model that might be more pleasant other than hand-waving about traditional OO interfaces and inheritance models rather than this weird Go thing you're boxed inside of. So that's not a criticism, it's just a constraint.

Users of Any might have certain expectations they bring to the party, if it's transparently a kinded union and they get this kinded API, it might feel weird to interact with. The only reason we think users aren't going to have quite the same weird feeling about other kinded unions is that they'll be explicitly called out in the schema they're working with, and not an implicit.

mvdan commented 3 years ago

How do kinded unions behave in other languages like JS? I assume they don't have this two-layer model like in Go, where we have the not-very-intuitive map at first, and the actual representation underneath?

I'm sure @warpfork has thought about this before, but I'm wondering if there would be nicer ways to offer both the logical level (map) and representation level (any) in Go without the layering.

rvagg commented 3 years ago

How do kinded unions behave in other languages like JS

We don't have the nice architecture being built out in ipld-prime, so a kinded union is just a property, of any kind, since, currently, we only operate on plain JavaScript objects. Which is fine as far as it goes but it doesn't get us the nicities of the data model, transparent traversal, drop-in ADLs, etc. I'm still trying to figure out how we get there in JS, but until then we have plain objects.

So ipld-prime is the forefront of how to do this. I'm just imagining a language like Rust coming along afterward, learning all of the lessons of ipld-prime, but using its additional language tools to come up with interfaces that leak much less than kinded unions are currently doing in ipld-prime. But that's really more of a hunch than something I can see. We're all still just relying on @warpfork's brain on the concrete bits here still!

warpfork commented 3 years ago

Yeah fwiw I don't think Golang has anything to do with this :) I'd write this exactly the same way in Java for example (it'd just be a lot slower because it would have more pointers in it...)

The logic that I'm running on here is:

... it just sort of keeps stacking like that :) If the type-level view fits the exact same interface as the representation level view, things just flow.

So when we encounter a story where the type-level view of the data doesn't "do what [somebody] means", it's interesting and maybe we want to try to make it do more of "what [sombody] means". (E.g., if we can make that "debug format for free" thing work "better" in some sense, that's a nice bonus.) But of course there's a limit to this: at most, when talking about "do what [somebody] means" goals, the best we can do is satisficing, because there are so many different "somebody"s with their various concept of "what I mean".

Primarily, my goal is thus that those morphisms be defined at all, so we can expect them to work on any and all data. (E.g. If the "debug format for free" thing doesn't do perfectly what somebody means, well, at least it worked: that provides some value.)

warpfork commented 3 years ago

Anyway tl;dr the way union type-level semantics has nothing to do with golang, and everything to do with I need to make them act like a map so that {the union members can be distinguishable from each other} and yet {all unions be uniform in approach when reduced to the lens of the datamodel}.

warpfork commented 3 years ago

Just as a short keepalive message: it looks like Option3 -- we define an any type-kind -- is the right way to move forward with this. It's got some thumbs-up, we haven't come up with any more blockers or contradictions it would produce, and it generally seems like it'll solve the problem.

Next step: I should figure out where this belongs in the documents and write it up.

mvdan commented 3 years ago

Happy to review PRs :)

rvagg commented 3 years ago

I'm still not comfortable with that option.

When writing a validator, the pattern I used to implement Any was a nested kinded union: https://github.com/rvagg/js-ipld-schema-validator/blob/3f8e1636b28c393f3637bbef0b3279d396536cd7/ipld-schema-validator.js#L36-L59

Which would be represented as:

type AnyMap {String:Any}

type AnyList [Any]

type Any union {
  | Bool bool
  | String string
  | Bytes bytes
  | Int int
  | Float float
  | Null null
  | Link link
  | AnyMap map
  | AnyList list
} representation kinded

This works super nicely for this purpose. It gets loaded as a set of implicits (also bring in the type names AnyList and AnyMap into the global space). And validation proceeds by drilling down into the final terminals without any additional special code to handle it.

The reason I find this illustrative for this discussion is because the concept of Any necessarily implies a recursive definition. It doesn't mean "anything you want", it means "any valid data model kind or recursive set of data model kinds". Which obviously matters for the validation case, and especially matters in the JavaScript case where you can potentially throw strange things in that shouldn't be there (undefined, function perhaps? maybe this could also be the case in Go with a naughty codec that throws out some interesting types).

Making any a magic thing means this all has to become embedded logic, which is not impossible, just a step backward.

I still really don't understand why this has to be a forcing function here:

The general rule of how unions work is: the logical-level always behaves as a map, where the keys are the names of the types. The representation-level, of course, varies wildly (e.g. kinded vs keyed unions behave nothing alike, representationally).

Maybe your concrete conception of a kinded union is wrong? Why do all unions need to behave like a map? Maybe a kinded union is a different kind of beast and you're forcing it into the wrong model. It doesn't even have logical keys, like a keyed union. Is there a way you can rethink the instantiation of a kinded union at the type level that isn't forcing your hand at this level?

warpfork commented 3 years ago

undefined, function perhaps?

I think around here is where we have divergence in what we're worried about what this will suggest to users. To me, sitting here with any compiler and host type system at all -- and with a library design that's oriented around an interface for Data Model Nodes -- this is a non-worry. Everything is still going to implement Node; everything that can be in that value is still going to have to answer Kind() when I ask it; and the answer to Kind() is going to have to be a value from the familiar (and strict) enum of Data Model kinds.

I think maybe you make a good point in regards to how we document this, at least -- it makes sense to have a sentence or two clarifying that Any is not meant to be an escape valve that goes all the way to infinity.

warpfork commented 3 years ago

I don't know if I ever linked these explicitly enough, btw, but an example of where this is important was brought up in an issue in go-ipld-prime by @willscott when he was working on the statediff project: https://github.com/ipld/go-ipld-prime/issues/92

Right now you can do things with the basicnode.Any implementation that you can't do with the Any definition we've made in schemas using kinded unions, and one way or another, that's definitely a design flaw that needs to get hammered out.

hannahhoward commented 3 years ago

One thing to consider about Any --

If a kinded union can involve user defined types:

type Foo struct {
   Bar Int
}

type Baz struct {
   Buzz String
}

type KU union {
  | Foo
  | Baz
} representation Kinded

Then Any as represented by:

type AnyMap {String:Any}

type AnyList [Any]

type Any union {
  | Bool bool
  | String string
  | Bytes bytes
  | Int int
  | Float float
  | Null null
  | Link link
  | AnyMap map
  | AnyList list
} representation kinded

is not a true "Any" as it doesn't contain the user defined types above.

This is more than an academic consideration. The current https://github.com/ipld/go-ipld-adl-hamt has an ANY at the bottom of the tree in the underlying representation, but most likely the types at the bottom are well defined on a per HAMT basis. Or, perhaps they are well defined but unknown until the moment of read -- in cbor-gen world there is a concept of Deferred, where we leave part of a IPLD block in a serialized representation and deserialize later when we know the type.

Anyway, much of this probably does not rise to level of things that should actually make it into the schema layer, but I think having an any that is truly "any" makes more sense. That leads me to believe Option 3 is better.

warpfork commented 2 years ago

https://github.com/ipld/ipld/pull/136 , assuming it lands, should finally address this.