swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.42k stars 10.34k forks source link

String encoding/decoding is a mess #69253

Open dabrahams opened 11 months ago

dabrahams commented 11 months ago

Recently I needed to decode some UTF8 into a String. I know that unicode is sometimes damaged and it's much more important for me to get the text that is readable than to get an error or nil telling me it's ill-formed. I happen to know that's what most people want most of the time, and I know I wrote the library facilities to do it. So I went to look for it in the documentation. Could not find it at all. Instead there are scads of redundant APIs for decoding that say they reject any ill-formed code units, so many it's almost impossible to figure out what I want to use. And there are new String.Encodings as well as Unicode.Codecs, with no deprecation or guidance about which to pay attention to. I had to dig through the standard library sources to find the APIs I knew existed. I don't know what happened since I was working on this stuff, but it's not a very friendly experience and I'm sure if I hadn't written the code I wouldn't have found what I needed. Sorry if this sounds like a complaint, but I think someone needs to put some real attention into the user experience in this area; too many choices is sometimes worse than too few.

stephentyrone commented 11 months ago

This is mentioned under future directions on SE-0405:

There is only one initializer in the standard library for input-repairing initialization, and it suffers from a discoverability issue. We can add a more discoverable version specifically for the UTF-8 encoding, similarly to one of the additions proposed here.

glessard commented 11 months ago

We'd proposed some overloads that include "utf8" in an argument label in SE-0405, but those were rejected for not solving the general problem: "if the tooling were improved to make UTF-8 usage of the general API more discoverable, would we still want the less general API?". The LSG answered no and added "we believe that this is an area that tooling can and should improve." The tooling improvement might be similar to the direction of SE-0299. Unfortunately, this has been an unsolved problem for quite a while. Another tooling problem is the documentation generation, which fails to convey some standard library deprecations.

dabrahams commented 11 months ago

More overloads will just make the problem worse. There are already FAR too many. Overloads have a high usability cost and should be minimized. You can’t fix it with tooling, because users still need to evaluate their many options and suss out any subtle differences. 

AnthonyLatsis commented 11 months ago

So I went to look for it in the documentation. Could not find it at all.

API design is one thing, but documentation is another. Are you referring to one of the overviews at https://developer.apple.com/documentation/swift/swift-standard-library?

dabrahams commented 11 months ago

API design is one thing, but documentation is another.

That point of view is certainly part of the problem. Your documentation describes the semantic part of your API, which can't be captured in the type system.

At some point DevPubs begged me to stop objecting to the way they were doing things (even though I had no authority to stop them from doing anything) because they said it was going to prevent us from shipping, so to be a team player, I did. At that point I had to accept that we would not have the level of coherence I wanted, but I was never under the illusion that the API's code and its documentation were separate from a user's point of view. They are part of one whole experience.

I'm not referring to any specific documentation page. I used the Apple documentation's own search and Gewgle and StackOverflow and found scads of options, many needlessly unsafe APIs, and never what I was looking for. It's not just overloads of course, but different APIs that fall into the same space.

dabrahams commented 11 months ago

@stephentyrone Note also, this premise from SE-405 is wrong, or at least was wrong at one point.

The String type guarantees that it represents well-formed Unicode text.

I distinctly remember when we were working on this we decided it was more important to be able to capture the very large quantity of ill-formed UTF-8 out there in a string without loss so that a whole document wasn't inaccessible just because of a single encoding error. Also this is important to save the cost of validation on construction of large strings. String's own elements were supposed to be unicode replacement characters when that was encountered, while its .utf8 view presented the original bytes.

karwa commented 11 months ago

I agree that this needs more attention, and mentioned the overwhelming number of initialisers during the pitch for SE-0405.

The documentation also doesn't do a good job bringing clarity to users. Here's what I see on the String docs page (and this touches on broader points than just encoding/decoding):

Overview section

Topics Section

Topic 1: "Creating a String"

Sounds promising for this issue, eh? We want to create a String! Except, the first thing I see is this:

image

Hmm... a bit odd. String.init(decoding: FilePath) (from swift-system) is certainly a nice API to have, but it is questionable whether it deserves such a prominent place in String's documentation.

So what other ways do we have to create a String? Here's what immediately follows:

image

Firstly, these do not appear to be logically sorted: notice how there is an init(Character) and init(Substring), which we could think of as "creating a String from another string element", but then there's a bunch of other stuff between them - init<S>(S), init<S>(S), and init<S>(S). A bunch of gobbledegook, basically.

After that (still in the same topic) we have:

image

Didn't take long for us to get to unsafe APIs, did it? Is this really how we want to introduce developers to String?

Topic 2: "Inspecting a String"

image

This is a bizarre little interlude before we get back to talking about String creation. Seriously, this is how it's structured: a bunch of String initialisers, then isEmpty and count in this little section, then a bunch more String initialisers. 🤷‍♂️

Topic 3: "Creating a String from Unicode Data"

image

Okay, so, it starts off with init(Unicode.Scalar) -- which IMO really would be more logically placed adjacent with init(Character) and init(Substring).

Then it puts the Foundation APIs first in the list. This is just wrong. The top initialiser - init(data:encoding:) should only be used if you need a legacy encoding such as Shift-JIS or MacRoman, and all UTF-X data (regardless of whether that data is in a Foundation Data object) you should use init(decoding:as:). That is the standard library's native initialiser which happens to be the very last entry in this list.

I would rate String.init(decoding:as:) as probably the most important String creation API we have. It is very unfortunate that it is not even mentioned in String's overview, only in the third topic section, and not only that, but relegated to the bottom of that section. You could very easily miss it entirely.

All the other stuff in here also needs to be de-emphasised:

I could go on, but I'll leave it there because we've reached String.init(decoding:as:), which is what this issue is about. I think I've made the point that the documentation is not clear, not logically organised, and highlights legacy APIs we'd rather developers didn't use at the expense of modern, native APIs which they should prefer. Many of those legacy APIs should just be deprecated.

I appreciate that cleaning up String's API, and organising its documentation, are not trivial tasks at all. They will need significant effort, but I also think they're very important.

AnthonyLatsis commented 11 months ago

That point of view is certainly part of the problem.

FWIW, that was purely a "we have a situation here" point of view.

dabrahams commented 11 months ago

Sorry to have misread you then, @AnthonyLatsis