doc: string honesty - Githubissues

mikeal commented 3 years ago

Ok, here’s the language I think we can all agree is the state of the world as far as strings are concerned.

I hope to put to bed additional discussion of data model guarantees in any particular direction when it comes to strings.

mikeal commented 3 years ago

@warpfork we’ll definitely want a deeper document on strings. I’d like that doc to provide better guidance for implementers that can clearly explain the tradeoffs and also list the existing implementations and how they handle each case. I’ll work on that (pulling from the doc you linked) once this is agreed upon and merged, but I agree with you on the need for something with clearer guidance for implementers.

vmx commented 3 years ago

Based on my comments and @warpfork comments I think the point of friction is what each of us considers what leads to greater interoperability. And also what we consider valid/invalid. I try to summarize, please @warpfork correct me if I got your points wrong:

@warpfork: Libraries that don't validate every piece of data and also allow invalid data are more versatile. They work even if the producer of the data didn't implement things fully truth to some specs. It leads to better systems as they won't error on subtle that might not even be a problem.
@vmx: If data is produced which isn't spec compliant, it shouldn't be expected to work properly with libraries that are implemented according to the spec. Interoperability should be guaranteed for cases where all actors comply to the specs. If you implement your library according to the spec, it is interoperable with everyone else complying to the spec. If not it's a bug. If a spec is based on existing specs it should not contradict existing constraints in order to provide maximum interoperability even on the lowest level.

prataprc commented 3 years ago

Applications that only serialize valid UTF8 in string values will be more interoperable than applications that do not.

Interoperablity with past and present is one thing, but interoperability with future is a different beast altogether.

The short version is, I've looked into the all the language independent codecs I could find and all of them specify how strings should be like (JSON, CBOR, MessagePack, ProtocolBuffers, Cap'n'proto, CUE, Apache Thrift, Apache Avro, Flatbuffers, BARE, BSON, Borsh, Amazon Ion).

May be this is a solid case for not doing it their way ;) when everyone make the same design choice, everyone wins or everyone looses.

Btw, I am only following the string-encoding discussion that happens in github, so pardon me if I sound too pedagogical.

Not the first time I am seeing string poping up as difficult data-type. Languages like Rust will simply call each string-encoding as different type altogether. But I guess IPLD is not a type system so we don't have that luxury.

Per my understanding, string data-type is going to be as color-ful as there are humans and human languages. And honestly I like to have it that way. We should remember that not all linguistically different population are equally empowered with computer technology in their native-tongue. But it will happen eventually, at which point each one of them could have their own "string-data-type". For now utf8 is the largest repository of linguistic analysis from programming PoV and has backward compatibility with ASCII.

Personally, one of the reason I am attracted to IPLD / IPFS is that it wants to be future proof. It is a super difficult problem.

Also I see "interoperability" being mentioned in two different context. One, that happens after a handshake between actors, and the other that happens without-a-handshake.

ipld / specs

doc: string honesty #331