ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Questions about ipld schema #140

Closed rklaehn closed 5 years ago

rklaehn commented 5 years ago

Not sure what is the best way to communicate, so since the protocol labs MO seems to be to do everything in github, here is an issue.

Schema kinds

Why are these called kinds? In the context of type theory, kinds have a specific meaning, and calling these kinds seems confusing at least to me.

Here is the association I have when hearing about kinds in the context of a type system. https://en.wikipedia.org/wiki/Kind_(type_theory)

Schema representations

## Fizzlebop is a pair of fields which serializes as "value-of-a:value-of-b" as a string.
type Fizzlebop struct {
    a String
    b String
} representation stringjoin {
    join ":"
}

Unless stringjoin prescribes some kind of escaping scheme, the type of a and b is not String but the subtype of all strings that do not contain the ':' character, otherwise the representation is not unique and thus not reversible. So basically saying that a can be any string is a lie.

rklaehn commented 5 years ago

@warpfork can you enlighten me about the rationale behind these decisions?

warpfork commented 5 years ago

kinds

The kinds enumeration at the Data Model layer is found here: https://github.com/ipld/specs/blob/master/data-model-layer/data-model.md#kinds

The kinds enumeration in the Schema layer is similar but adds a few more (e.g. struct, enum, etc).

I'd say the meaning isn't wildly different than the type theory one, although we don't really intend to use the word with any kind of "rank-N"/"higher-order" systems. The way we use the word here is also a lot like the way golang uses the word in their type system: https://golang.org/pkg/reflect/#Kind

representations and escaping

tl;dr: Yes, you're right -- the implication of the combination of these features is not entirely pleasing. As currently planned, the struct stringjoin representation simply won't be valid for all ranges of values.

How did we get here?

So, there's indeed a series of "between a rock and a hard place" issues with this. The approach of rejecting values which would reach unhappy paths does result in the statement "The type definition is no longer that useful since you need to know the representation to know the limits" being then, unfortunately, true -- and that definitely conflicts with some of our fundamental design goals. It also seems to be the only viable approach to providing the feature. So here we are.

The only consolations I can offer are: one, this only comes up with things that were "countable infinity" cardinality already (namely, strings), which seems slightly less bad; two, it's very much an opt-in feature that can be avoided as much as you like (and we can and should put appropriate caveats around it).

For consideration: most common example I've come across for using this kind of stringjoin struct is in the expression of some sort of "plugin" system, where there's a prefix indicating which of the "plugins" to use, and the remainder is an "argument" to that plugin. In this sort of a use-case, the first segment is effectively an enum anyway, and thus this otherwise valid concern becomes moot in practice.

We definitely need more documentation and warnings around these limitations. (The current state of this feature is "planned" moreso than any degree of "shipped", so the docs are currently very much an outline -- sorry :) More warnings == yes please!) If there's things we can do to improve the overall odds of using the feature well and leading users to avoid using it badly -- for example, having schema "compile time" validators that detect and error on invalid compositions that are certain to trigger unreversable concatenations -- I definitely want to explore it as well.

warpfork commented 5 years ago

Perhaps also worth mentioning on related topics: there are already some other examples of limitations in composability of some of the features in schemas. Some of these simply emerge from the combinations of features we want.

For example, "kinded unions" have all sorts of fun limitations, and these are based on the representation strategy: for example, a kinded union can contain two different struct types, so long as one is for example a map in representation, and the other is a list (via the 'tuple' representation strategy); or we could add a third via one of the representation strategies that becomes a string. This is an interesting example because the validity of the composition depends on the representations, even though the cardinality logic we can use on the composition can still disregard the representation.

Similarly, "inline unions" have several limitations: they only work when all member types have representation kind map; and if any of those maps are representations of structs with a field name that's the same as the union discriminant key, then we can statically say this is an invalid composition.

So, the limitations around stringjoin and similar representations of structs are not alone in having required consideration of a tug of war between utility and practicality and purity! :)

rklaehn commented 5 years ago

Regarding kinds: I have never written go, so this did not sound familiar to me. I have written lots of scala, where kinds are usually used in the context of higher kinded types like in the linked wikipedia article. Still not entirely happy with the name, but I guess I can live with it.

Regarding the representations, I guess it would help if there was some kind of description of the goals of the schema layer. Maybe it exists in one of the zillion repos and I have not found it yet...

I guess I don't fully get why you need this extreme representation flexibility. I originally thought that this was just an attempt to add more type information to e.g. dag-cbor, but it seems that you want to have the ability to use the schema layer on top of arbitrary protocols, which then of course leads to the sort of compromises you had to make.

"And yet there are protocols out there which use prefixes such as this that I want to be able to support and work well with;"

rvagg commented 5 years ago

I guess I don't fully get why you need this extreme representation flexibility. I originally thought that this was just an attempt to add more type information to e.g. dag-cbor, but it seems that you want to have the ability to use the schema layer on top of arbitrary protocols, which then of course leads to the sort of compromises you had to make.

Sort of, we're seeing IPLD as mainly operating above the protocols/codecs, which is why the "data model" exists, it's kind of a lowest-common-denominator of things we can get the codecs we care about to work with (some of which require a bit of coercion or compromise). If we accept the data model as usable, how can we describe more complex "types" that are built using the individual values in the data model, which is where schemas come in. One reason we've accepted "kinds" as a word to describe the values in the data model is that we hope to be moving to a place where we're referring to them less and to "types" more often, these things that are more formally composed at a layer above the data model.

That's the theory anyway, it's all very new and in flux (hence the documentation problems) so now's a great time to influence the thinking as it evolves. As we're pushing into practical (non-IPFS) use-cases of these things, we're hitting edges that are a little uncomfortable so we come back for discussion so we're pretty open to adjusting course. As an example of this, see https://github.com/ipld/specs/issues/144 where we ran into a problem of (some of us) initially assuming that we could build some things we want now on top of schemas, but finding that the fit isn't quite right, so there's a bit of a fork taking place. We'd like to reconcile that and bring them back into alignment, but right now schemas are a new idea and there's very little infrastructure in place to start building on top of them or extending them.

Some activity right now (might be useful to someone browsing this issue, in lieu of more expansive docs, hopefully):