Closed vmx closed 3 years ago
The language I recommended for the Data Model spec from our last call was:
Strings SHOULD be valid UTF8. Strings MAY be contain invalid UTF8
characters and codecs implementations should document a strategy for
handling such strings when they are encountered in languages that cannot support
non-UTF8 characters in the native String type.
For map keys it’s a little trickier, we should probably have a more detailed discussion about how to document that and where it should go. My suggestion was:
Let us say a map is defined like Map<K>
, where K is parametrised key-type and the value is enumeration of all supported kinds in data-model. It might end up like this:
struct Map<K> where K: Hash {
...
}
If we don't have a common denominator on how K can be represented in bytes, can we have a consistent/uniform index() operation for the same map value across different implementation ?
Btw, one more question. Other than serialization, other than hash-digest on serialized block, other than schema-matching, and other than indexing operation within list and map kinds, should IPLD worry about any other operational behaviour of data ?
@prataprc Your idea is more general. What we have in mind is keeping things as simple as possible. And I expect being more specific and less generic to be easier to reason about and implement.
The language I recommended for the Data Model spec from our last call was:
Strings SHOULD be valid UTF8. Strings MAY be contain invalid UTF8 characters and codecs implementations should document a strategy for handling such strings when they are encountered in languages that cannot support non-UTF8 characters in the native String type.
My proposal here explicitly does not talk about memory layout. It also does not talk about what Codecs should do. And I think this is the sweat spot about it.
This means that from a pure Data Model perspective Text with invalid Unicode will be an invalid. In a document that talks about Codecs could then talk about what to do if you encounter invalid Unicode characters.
I've two more points:
Text
? Why is that desirable? Why should it be supported and not just lead to an error?Talking about Codecs: Do we expect that non Unicode Text
round-trips properly in a Codec? I.e. I have a Text with non Unicode and transform it into my Data Model/programming language and I then transform it back. Does it need to just work or could it error?
If it could error then it aligns what I'm proposing here. Only because non-Unicode Text
would be an invalid Data Model Kind, doesn't mean that your implementation might not be able to deal with those invalid things in a special way. This is what I'm after. I want to have a simple/happy "common case" (and yes, I really think Unicode is the common case for text) code path, that is easy to implement across most of programming languages.
1. Data Model specifies that maps must have a single and consistent key type but it’s up to the codec and the language it’s implemented in to define what that type is.
This aligns with what I'm saying here. This means that from a Data Model perspective they key can be arbitrary bytes, hence is the Byte-Array
Kind.
I've added a few questions and answers that came up on IRC by @warpfork.
I'm closing this one, it did serve good enough as discussion starter.
This is triggered by the "Strings" discussion. Please note that the work "String" doesn't even appear in the document (except for the appendix).
My hope that this view on the IPLD Data Model will give a new perspective on the "what are Strings" discussion.
I decided to push it as a PR, so that we can discuss it here and even have inline comments.
This isn't really meant to be merged. Hopefully the outcome of this discussion will lead to something that then can be merged.