String encoding/decoding is a mess

dabrahams commented 11 months ago

Recently I needed to decode some UTF8 into a String. I know that unicode is sometimes damaged and it's much more important for me to get the text that is readable than to get an error or nil telling me it's ill-formed. I happen to know that's what most people want most of the time, and I know I wrote the library facilities to do it. So I went to look for it in the documentation. Could not find it at all. Instead there are scads of redundant APIs for decoding that say they reject any ill-formed code units, so many it's almost impossible to figure out what I want to use. And there are new String.Encodings as well as Unicode.Codecs, with no deprecation or guidance about which to pay attention to. I had to dig through the standard library sources to find the APIs I knew existed. I don't know what happened since I was working on this stuff, but it's not a very friendly experience and I'm sure if I hadn't written the code I wouldn't have found what I needed. Sorry if this sounds like a complaint, but I think someone needs to put some real attention into the user experience in this area; too many choices is sometimes worse than too few.

stephentyrone commented 11 months ago

This is mentioned under future directions on SE-0405:

There is only one initializer in the standard library for input-repairing initialization, and it suffers from a discoverability issue. We can add a more discoverable version specifically for the UTF-8 encoding, similarly to one of the additions proposed here.

glessard commented 11 months ago

We'd proposed some overloads that include "utf8" in an argument label in SE-0405, but those were rejected for not solving the general problem: "if the tooling were improved to make UTF-8 usage of the general API more discoverable, would we still want the less general API?". The LSG answered no and added "we believe that this is an area that tooling can and should improve." The tooling improvement might be similar to the direction of SE-0299. Unfortunately, this has been an unsolved problem for quite a while. Another tooling problem is the documentation generation, which fails to convey some standard library deprecations.

dabrahams commented 11 months ago

More overloads will just make the problem worse. There are already FAR too many. Overloads have a high usability cost and should be minimized. You can’t fix it with tooling, because users still need to evaluate their many options and suss out any subtle differences.

AnthonyLatsis commented 11 months ago

So I went to look for it in the documentation. Could not find it at all.

API design is one thing, but documentation is another. Are you referring to one of the overviews at https://developer.apple.com/documentation/swift/swift-standard-library?

dabrahams commented 11 months ago

API design is one thing, but documentation is another.

That point of view is certainly part of the problem. Your documentation describes the semantic part of your API, which can't be captured in the type system.

At some point DevPubs begged me to stop objecting to the way they were doing things (even though I had no authority to stop them from doing anything) because they said it was going to prevent us from shipping, so to be a team player, I did. At that point I had to accept that we would not have the level of coherence I wanted, but I was never under the illusion that the API's code and its documentation were separate from a user's point of view. They are part of one whole experience.

I'm not referring to any specific documentation page. I used the Apple documentation's own search and Gewgle and StackOverflow and found scads of options, many needlessly unsafe APIs, and never what I was looking for. It's not just overloads of course, but different APIs that fall into the same space.

dabrahams commented 11 months ago

@stephentyrone Note also, this premise from SE-405 is wrong, or at least was wrong at one point.

The String type guarantees that it represents well-formed Unicode text.

I distinctly remember when we were working on this we decided it was more important to be able to capture the very large quantity of ill-formed UTF-8 out there in a string without loss so that a whole document wasn't inaccessible just because of a single encoding error. Also this is important to save the cost of validation on construction of large strings. String's own elements were supposed to be unicode replacement characters when that was encountered, while its .utf8 view presented the original bytes.

karwa commented 11 months ago

I agree that this needs more attention, and mentioned the overwhelming number of initialisers during the pitch for SE-0405.

The documentation also doesn't do a good job bringing clarity to users. Here's what I see on the String docs page (and this touches on broader points than just encoding/decoding):

Overview section

The only things it says about creating a String are:

The String type bridges with the Objective-C class NSString

and

You can create new strings using string literals or string interpolations.

That's all. It doesn't mention anything about these initialisers.
The section about comparing Strings uses the familiar "café" example to try illustrate what canonical equivalence is, but in my experience most people still don't understand what this is actually showing. It would be more useful to say something like:

Two Strings may be canonically equivalent even if they have a different sequence of scalars

And leave it at that.
Each of the String views: the UnicodeScalars view, UTF16 view, and UTF8 view, are given their own subsections. It's good (and important) to mention them, but IMO they have too much presence in this mini-article introducing the String type. We should combine these in to one subsection which mentions these views -- and personally I would only show examples using the UTF8 view, while mentioning that the UTF16 view also exists.

Another idea would be to break that whole discussion off in to a separate article. It's a deep topic which requires explaining some background in order for developers to truly understand it. It may need more depth than we can afford in the "overview" section on such a commonly-used type as String.
The section about String using copy-on-write could also probably be dropped. It should be expected at this point IMHO, and it feels distracting in this document which should be describing the essence of String.

Topics Section

Topic 1: "Creating a String"

Sounds promising for this issue, eh? We want to create a String! Except, the first thing I see is this:

Hmm... a bit odd. String.init(decoding: FilePath) (from swift-system) is certainly a nice API to have, but it is questionable whether it deserves such a prominent place in String's documentation.

So what other ways do we have to create a String? Here's what immediately follows:

Firstly, these do not appear to be logically sorted: notice how there is an init(Character) and init(Substring), which we could think of as "creating a String from another string element", but then there's a bunch of other stuff between them - init<S>(S), init<S>(S), and init<S>(S). A bunch of gobbledegook, basically.

After that (still in the same topic) we have:

Didn't take long for us to get to unsafe APIs, did it? Is this really how we want to introduce developers to String?

Topic 2: "Inspecting a String"

This is a bizarre little interlude before we get back to talking about String creation. Seriously, this is how it's structured: a bunch of String initialisers, then isEmpty and count in this little section, then a bunch more String initialisers. 🤷‍♂️

Topic 3: "Creating a String from Unicode Data"

Okay, so, it starts off with init(Unicode.Scalar) -- which IMO really would be more logically placed adjacent with init(Character) and init(Substring).

Then it puts the Foundation APIs first in the list. This is just wrong. The top initialiser - init(data:encoding:) should only be used if you need a legacy encoding such as Shift-JIS or MacRoman, and all UTF-X data (regardless of whether that data is in a Foundation Data object) you should use init(decoding:as:). That is the standard library's native initialiser which happens to be the very last entry in this list.

I would rate String.init(decoding:as:) as probably the most important String creation API we have. It is very unfortunate that it is not even mentioned in String's overview, only in the third topic section, and not only that, but relegated to the bottom of that section. You could very easily miss it entirely.

All the other stuff in here also needs to be de-emphasised:

init?(validatingUTF8: UnsafePointer<CChar>) is now init?(validatingCString:). It should live in the later section "Converting a C String"
init(utf16CodeUnits:count:) should probably just be outright deprecated. If you have a pointer and count, you can create an UnsafeBufferPointer and send it through String.init(decoding:as:).
init(utf16CodeUnitsNoCopy:count:freeWhenDone:) actually is deprecated already! But that isn't shown on the topics page, and for some reason it still ranks higher than String.init(decoding:as:) 😔

I could go on, but I'll leave it there because we've reached String.init(decoding:as:), which is what this issue is about. I think I've made the point that the documentation is not clear, not logically organised, and highlights legacy APIs we'd rather developers didn't use at the expense of modern, native APIs which they should prefer. Many of those legacy APIs should just be deprecated.

I appreciate that cleaning up String's API, and organising its documentation, are not trivial tasks at all. They will need significant effort, but I also think they're very important.

AnthonyLatsis commented 11 months ago

That point of view is certainly part of the problem.

FWIW, that was purely a "we have a situation here" point of view.

dabrahams commented 11 months ago

Sorry to have misread you then, @AnthonyLatsis

swiftlang / swift