ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Discussion: IPLD Data Model #324

Closed vmx closed 3 years ago

vmx commented 3 years ago

This is triggered by the "Strings" discussion. Please note that the work "String" doesn't even appear in the document (except for the appendix).

My hope that this view on the IPLD Data Model will give a new perspective on the "what are Strings" discussion.

I decided to push it as a PR, so that we can discuss it here and even have inline comments.

This isn't really meant to be merged. Hopefully the outcome of this discussion will lead to something that then can be merged.

mikeal commented 3 years ago

The language I recommended for the Data Model spec from our last call was:

Strings SHOULD be valid UTF8. Strings MAY be contain invalid UTF8 
characters and codecs implementations should document a strategy for 
handling such strings when they are encountered in languages that cannot support 
non-UTF8 characters in the native String type.

For map keys it’s a little trickier, we should probably have a more detailed discussion about how to document that and where it should go. My suggestion was:

  1. Data Model specifies that maps must have a single and consistent key type but it’s up to the codec and the language it’s implemented in to define what that type is.
  2. Codecs should document these choices and call out any language/library differentiation we know about (this list will grow over time as more implementations are added).
prataprc commented 3 years ago

Let us say a map is defined like Map<K>, where K is parametrised key-type and the value is enumeration of all supported kinds in data-model. It might end up like this:

struct Map<K> where K: Hash {
    ...
}

If we don't have a common denominator on how K can be represented in bytes, can we have a consistent/uniform index() operation for the same map value across different implementation ?

Btw, one more question. Other than serialization, other than hash-digest on serialized block, other than schema-matching, and other than indexing operation within list and map kinds, should IPLD worry about any other operational behaviour of data ?

vmx commented 3 years ago

@prataprc Your idea is more general. What we have in mind is keeping things as simple as possible. And I expect being more specific and less generic to be easier to reason about and implement.

vmx commented 3 years ago

The language I recommended for the Data Model spec from our last call was:

Strings SHOULD be valid UTF8. Strings MAY be contain invalid UTF8 
characters and codecs implementations should document a strategy for 
handling such strings when they are encountered in languages that cannot support 
non-UTF8 characters in the native String type.

My proposal here explicitly does not talk about memory layout. It also does not talk about what Codecs should do. And I think this is the sweat spot about it.

This means that from a pure Data Model perspective Text with invalid Unicode will be an invalid. In a document that talks about Codecs could then talk about what to do if you encounter invalid Unicode characters.

I've two more points:

vmx commented 3 years ago

Talking about Codecs: Do we expect that non Unicode Text round-trips properly in a Codec? I.e. I have a Text with non Unicode and transform it into my Data Model/programming language and I then transform it back. Does it need to just work or could it error?

If it could error then it aligns what I'm proposing here. Only because non-Unicode Text would be an invalid Data Model Kind, doesn't mean that your implementation might not be able to deal with those invalid things in a special way. This is what I'm after. I want to have a simple/happy "common case" (and yes, I really think Unicode is the common case for text) code path, that is easy to implement across most of programming languages.

vmx commented 3 years ago
1. Data Model specifies that maps must have a single and consistent key type but it’s up to the codec and the language it’s implemented in to define what that type is.

This aligns with what I'm saying here. This means that from a Data Model perspective they key can be arbitrary bytes, hence is the Byte-Array Kind.

vmx commented 3 years ago

I've added a few questions and answers that came up on IRC by @warpfork.

vmx commented 3 years ago

I'm closing this one, it did serve good enough as discussion starter.