Remove less bugs with string as bytes section

vmx commented 3 years ago

Removing the section about less bugs in regards to handling strings as bytes. This just isn't true. I know of two instances where there were bug reports about strings, both were discovered in Rust implementations (where strings are UTF-8) and were reported against (and fixed) on the Go implementations.

warpfork commented 3 years ago

What bugs and changes were those?

Because we certainly haven't changed any of the Go implementations I'm working on, and I'm (again, same as the last time we talked about this) fairly completely certain we'd have show-stopping regressions if we did causally push a change that narrowed the range of values we can safely work with while calling them strings.

vmx commented 3 years ago

It wasn't in IPLD, but in libp2p, but it's the same problem: https://developer.actyx.com/blog/2020/07/27/libp2p-gossipsub/ The result was to switch the code from using bytes instead of strings. The other issue I'm referring to is some ID field in Filecoin, which was planned to (I don't know if it actually switched) also switch to "bytes" instead of a string as it can contain arbitrary bytes.

Given those cases, I think it's not fair to say that an implementation treating strings as arbitrary bytes will see less bug reports. I also don't say it's the reverse. Hence I think removing this sentence from the spec is a good idea.

warpfork commented 3 years ago

The blog about libp2p has a problem which stems from something trying to be more restrictive than other protocols actually behaved in the wild -- which is the point of the paragraph you're proposing we remove: the reason we don't want to define IPLD strings as anything other than 8-bit sequences of bytes is because implementations in the wild will implement them that way, and it is attempts to be more strict that actually manufacture the troubles... and therefore we don't want to suggest that people manufacture these troubles in the first place.

The thing about filecoin which you refer to is irrelevant, because 100% of people involved agreed that it should not have been labelled a string field at any point.

warpfork commented 3 years ago

This text emerged after several months of litigation not so many months ago. I'm not revisiting it at this time.

vmx commented 3 years ago

I would prefer if the spec wouldn't speculate about things. So far I don't see an indication that the sentence I want to see removed is true.

I agree that in the cases I mention, the issue was caused by the implementations that are stricter. And I think we also agree that there will be such implementations. But the result is, that the bug reports (as mentioned in the cases above) are opened on the repositories that are less strict. So those repositories will see more bug reports. You could surely question whether it is a bug or not (we won't find agreement here), but this sentence is not about bug, but about bug reports.

I re-open this issue as @warpfork might not be revisiting this issue, but others might.

aschmahmann commented 3 years ago

But the result is, that the bug reports (as mentioned in the cases above) are opened on the repositories that are less strict

FWIW I don't think we have sufficient evidence for this. What we do know is that if specs are not clear people will make mistakes due to miscommunications, and if the spec does not match the reality of existing implementations things are event worse.

In the libp2p example the issue IMO was that the protocol spec for pubsub topic names was unclear, but was implied to rely on the protobuf spec for strings (UTF-8) despite early users relying on non-UTF-8 topic names.

In protobufs' case they decided "A string must always contain UTF-8 encoded or 7-bit ASCII text.", but then the Go (and apparently for proto2 many other lanugages as well) codegen didn't want to verify this was true and didn't give any big red flags which led to people using non UTF-8 in violation of the spec and putting the protobuf team in a rough spot. Take a look at some of the issues listed in https://github.com/golang/protobuf/search?q=string+utf8&type=issues if you're curious.

vmx commented 3 years ago

FWIW I don't think we have sufficient evidence for this.

@aschmahmann Do you think there sufficient evidence for the current wording?

warpfork commented 3 years ago

To contextualize why I'm giving this proposed change flak: it's because I think it's trying to snipe at the edges of things you don't like rather than engaging in the substantive roots of why things come out this way. You can try to keep re-arranging deck chairs on this thing you're turning into the Titanic all you want, but I don't understand what you hope to accomplish. This documentation is not the root of any of the issues you're frustrated with.

We've tried to engage substantively on this over multiple months, in many, many, many, many, many, many jousts, both you and me, and you and many others both inside and outside our team, and there are dozens of gists and hackmds littering the road, and dozens of hours spent in meetings, and entire tables of options for ways to try to resolve this in a way that make make you happier.

I don't want to say any of this is perfect, but I'm begging you, yet again, if you really want to change the state of play on this, please bring your attention to where the real lever/fulcrum is, and if that's not been made clear by any of the dozens of previous engagements: the lever and/or fulcrum that needs to move is: improving the definition of how we can have arbitrary binary in keys in maps, and what implementations will actually do, in the wild; it is not this documentation. You can snipe at this couple of sentences if you really want, but I don't see how it advances any other material progress on the subject.

I honestly barely care if we delete these words or not at this point. I don't think it matters. I am, however -- and I am putting this fully on the public record, because it's been true for a while now -- of very limited patience with this skirmishing about it. This couple of sentences is the minimum that people could agree to, after these months of fighting about it. I think you could find almost any other part of our docs and specs to discuss and it would be more valuable.

aschmahmann commented 3 years ago

TLDR: I think this wording is fine.

The current wording is:

Codec implementations that can de-serialization and roundtrip arbitrary byte data in strings will see fewer bug reports from people working with data produced by applications that serialize arbitrary byte data into strings.

This seems almost true by definition. If a codec implementation is able to round-trip non-UTF-8 characters through a string then it'll likely see fewer reports from people who are trying to serialize arbitrary byte data into strings 😁.

I say almost because it's possible someone who writes their code against a codec that roundtrips bytes will come back later to complain "How your code didn't yell at me when I tried to put arbitrary bytes into strings? I would've used the bytes type if I'd known but now it's too late!". This is a fair point and also why we have this line in the spec "Applications SHOULD only encode UTF8 data into string values and use byte values when they need arbitrary bytes".

I'd hope users notice that there are two types, one called string and another called bytes and think about what they should be using. It could be that implementations could benefit from some form of tooling to remind developers who don't look at specs or realize what they're doing. However, being able to handle the looser definition of strings we know people are using in the wild generally seems like a good thing to prevent breakage/interoperability issues.

Side note: I also tend to agree with @warpfork here that one way to try and define our way out of the box and allow ourselves to be stricter with strings would be to look at how we support binary map keys.

If you look at https://github.com/golang/protobuf/issues/922 you can see a user trying to figure out how to deal with older protobuf data in their new code when the code strictness changed out from underneath them. AFAIK the biggest (only?) issue with just doing a find and replace of string -> bytes everywhere application developers put non-utf-8 characters into strings is that map keys that are bytes are not supported.

rvagg commented 3 years ago

I'm fine with the text as it is now because it's true. I was on the verge of adding a decodeStringsAsBytes flag to @ipld/dag-cbor as a crazy hack to get these problematic Filecoin blocks to decode properly. Full round-tripping is another matter entirely and I have no interest (atm) of exploring a special IPLDString type in JavaScript to make this happen, although I've considered various hacks like the mentioned flag that could be used on a per-block basis when you have certainty about the behaviour you want (`encodeStringFieldsAsBytes: ['foo/bar/id']).

vmx commented 3 years ago

Thanks everyone for taking the time to review this again.

ipld / specs

Remove less bugs with string as bytes section #364