ipld / go-ipld-prime

Golang interfaces for the IPLD Data Model, with core Codecs included, IPLD Schemas support, and some handy functional transforms tools.
MIT License
132 stars 50 forks source link

bindnode unions are problematic for use as map keys #234

Open warpfork opened 3 years ago

warpfork commented 3 years ago

tl;dr: Bindnode should have some way to support unions that does not require the use of pointers. Currently this is not the case. This is important for map keys.


It's desirable that when writing their own golang types for use with bindnode, people should be able to avoid use of pointers whenever possible, and in particular this is quite important if they want to use a golang type as a map key.

To be useful as a map key, the comparison operation on a value must be useful. The comparison operation for pointers is... pointer equality. This is very often not useful in a map key. It means it's very hard to create a new value that will be equal to an existing value, even if you have all of the same constituent data. As long as you hold on to literally the the values originally used as keys, you can use them to look things up again; that's the only option you have. (And this can be doubly problematic because it's not super obvious if you make mistakes about this.)

In particular we can illustrate this clearly with a union that we want to use as a map key:

type Foo string
type Bar string
type FooOrBar union {
    | Foo "foo:"
    | Bar "bar:"
} representation stringprefix

type FancyMap {FooOrBar:String}

Right now this will require the user to write a golang type for the union that contains pointers:

type FooOrBar struct {
    Foo *Foo
    Bar *Bar
}

... and, well, you can probably see how that goes. If one parses the string "foo:woo" into a &FooOrBar twice, you'll get two different values: they'll both have the Foo field set, and if you deference that pointer, it'll be the same string... but you'll have different pointers in the middle. And that'll foobar your equality for purposes like a golang map key.

We can't compensate for this by adding our own equality functions in IPLD. It's a golang native map behavior we're talking about here; it can't be patched over from the outside.


I think the solution here will be: We'll have to reintroduce the ability to use an integer field as the in-memory discriminator, and allow non-pointer fields for the members.

I think we can make this optional and just engage it if we find such a field. But we'll have to carefully define exactly what "find such a field" means.

And then it'll still probably also remain a good idea to document some warnings about this, because it'll be in the court of the person authoring the go types to not put themselves in the problematic situation. (The logical errors you'll eventually get from golang itself are not helpful, speaking from experience.)

mvdan commented 3 years ago

I might surprise you with my take here: I don't think we should do this.

I agree that avoiding pointers can be a good thing. Be it for memory locality, or to avoid overhead, or to make equality follow the data.

But I disagree that we need to support using Go types as map keys in bindnode. Two of the most common data model kinds, Bytes and List, will simply not work as map keys since they are slices. Trying to work around that will be very hacky and painful to maintain.

I have a similar opinion on unions. We could support multiple ways to represent them in Go, but I ultimately do not think it's a good idea.

you'll get two different values

As an aside thought, we could probably support some form of opt-in "memoization" or "reuse Go values", akin to json's https://github.com/golang/go/issues/32779. Since IPLD nodes should be immutable when in memory as Go values, it wouldn't even be painful to use. We could even consider turning this on by default if we could figure out some form of sane global defaults.

Beyond the memoization idea above, here are two other suggestions:

1) "Dereference" the union, using its target value as the map key. This will work pretty easily for simple cases where you have a top-level union, I think.

2) Use an encoded form of the node as a map key, like a dag-cbor-encoded string. This should work beautifully for any complex or unknown types; it will work for lists, bytes, unions, etc.

warpfork commented 3 years ago

re-avowing motivation

This is a hard requirement to me. I use union types in map keys all the time when specifying new protocols. So I really need it to work.

And the current status quo of "you must reuse the keys you got from iterating" is pretty nonviable. Sometimes the application logic doesn't permit that as an option at all. Even when it does, the UX is nasty: I spent a long time thinking our bindnode implementation was completely broken when I first tried to do use union types in a map key... until I realized that it was just doing a semantically wrong thing in golang when I created new values and expected to be able to look them up. I think we'd be guiding pained users through that quite often if we don't support a better way to work with this.


don't worry, it's not a portal to infinite madness

Mind: this isn't saying that anything and everything needs to be valid as a map key; there's still limits. From the IPLD Schema side: the requirement is: to be used as a map key, a type must have a representation kind of string. (So: a structs with stringjoin repr is fair game; a union with stringprefix repr is fair game; but lists are not, and bytes are not, and structs with map repr or tuple repr are not.)

"Coincidentally", those requirements mostly line up very nicely with what golang map keys can be. (Golang is mostly more permissive: e.g. any struct is fair game in golang; the range of structs IPLD Schemas will allow as map keys is much narrower.) Right now, the one exception to this is the unions, because of this issue with pointers.


Okay, you have several ideas, sweet :bow: . I will number them and make notes:

approach 1: support unions with a discriminant int field and no pointers

(this is my original proposal, just numbering it and re-listing it for all-in-one reading.)

approach 2: memoization?

Memoization can be an interesting approach but I'm tepid, bordering on cold, on that working well here. I remember building an application in java once that would only work correctly if you remembered to use String.intern before using things as map keys; it was perceived as a very buggy API and I regretted the approach. (The fix of "wrap all access in forced memoization" is also not really available to us in golang in this context!) The concept of memoization also has nontrivial interactions with memory retention (as the linked golang issue also immediately mentions), and getting a performance concern entangled with a logical correctness concern is usually a bad idea.

approach 3: use dereferenced inhabitant as key?

Using the union's actual inhabitant value as the map key. Okay, that's a cool idea.

Except:

approach: use the encoded form as key?

This I think is totally an option, yeah.

The end user could just have a golang string as their map keys? It's a bit, um. Hm. The ergonomics of this seem challenging, on further thought.

We might need to add more functions to bindnode to make it easy enough to work on those values, then. It would be representation-oriented operations rather than type-level-oriented, and all the bindnode API right now assumes it's binding to something type-level.

We'd also lose any ability to make sure values being put in that map are actually correct. E.g. with this approach, one could have a type FancyMap {FancyKey:String}; type FancyKey { a String; b String } representation stringjoin(":") and just slam the string "whoopsnojoinerhere" into the golang type type FancyMap[string]string.

Worth considering. But my initial thought is it might be more work and yet have less guard rails than Approach 1.


tl;dr I'm open to other ideas here, but I still think supporting a shape of golang structures for unions that don't use pointers is probably the clearest path to victory.

warpfork commented 3 years ago

Okay not to nag but I want to log that I just hit this again while being a user and wanting to poke around inside a value in my application to debug it. I can't quickly type out a value for the key -> ow, debugging something beyond that value is agony.

My empirically observed halflife between pain events on this is... low.

mvdan commented 3 years ago

I mulled over this last night, and I agree with your conclusion. My points boiled down to:

1) not all schema types will be usable as Go map keys 2) we don't want bindnode to have too many ways to represent schemas as Go types 3) we should evaluate existing alternatives first

We agree on 1 and 2, I think. On 3 I think it's more subjective. I think using a codec with map[string]V is the only true generic solution, even if you lose some type safety. Your proposed change does extend the ability to use map[K]V with union types, so I think that makes sense.

As for the implementation: it should be as simple as recovering the code we had before, and adding a branch in every one of those places. Something like:

If the first field is called Index, and its type is exactly int, use the integer to store state and require the rest of the fields to be non-pointers representing the elements. Otherwise, expect all fields to be pointers representing the elements.

One if condition at each of those places should be basically free, because either mechanism always has to at least look at the first field already.

warpfork commented 3 years ago

Yep, that analysis seems correct.

... huh. I had been about to say "maybe it's time to do more up-front validation on the golang type shapes, and memoize it", but I guess I don't have a performance argument here to support that position. Neat.

warpfork commented 3 years ago

I thing I considered, and want to log, but seems to be a nonissue:

I felt a little tetchy at first about a rule like "first field is called Index and its type is exactly int", because I'm afraid we're mixing data plane and control plane. E.g., one would be in trouble if one wanted to have a union that legitimately has a first member that's called "Index" (unlikely though that collision may be).

But I guess that's technically not a problem, because the field names in golang don't matter at all in how we've implemented unions in bindnode (just their offsets in the struct do). So, okay!

mvdan commented 3 years ago

I thought about that collision too, but it's actually impossible if we require the non-indexed version to use pointers. Because then, the first valid field would have to be Index *int, not Index int.

And yeah, we're not doing any upfront preparation and memoization of reflect work right now. We probably will end up doing some of it later on, as part of optimizations, kinda like what json does for its marshalers. For now, the performance seems to be good enough, and this particular case doesn't make it any worse :)

mvdan commented 3 years ago

Here's another reason to do this: sometimes the pointers add unnecessary overhead of their own. For example:

type AnyScalar struct {
    Bool   *bool
    String *string
    Bytes  *[]uint8
    Int    *int
    Float  *float64
}

Using pointers to basic or tiny Go types is most likely a bad idea. An Index integer plus non-pointer values would use basically the same amount of memory, and avoid jumping around memory with pointers.