google-research / dex-lang

Research language for array processing in the Haskell/ML family
BSD 3-Clause "New" or "Revised" License
1.58k stars 107 forks source link

Support unicode strings #302

Open oxinabox opened 3 years ago

oxinabox commented 3 years ago

This seems incorrect

>=> "a🍕b"
['a', 'U', 'b']
>=> "aĺąłb"
['a', 's', 'b']
oxinabox commented 3 years ago

Oh this happess for simple character literals also

>=> :p '🍕'
'U'
>=> :p 'ĺął'
's'
dougalm commented 3 years ago

Right, our String type is currently just a list of bytes, which only works for ASCII. We should figure out how to support Unicode.

We want to avoid making the same mistake as Haskell, which is notorious for having too many string types: String, Text, Text.Lazy, ByteString, ByteString.Lazy, ShortByteString, ...

The Text/ByteString distinction is fundamental but the rest of Haskell's zoo of string types is just about navigating a performance-expressiveness tradeoff. The default String is a native list of characters, which is a very convenient object to work with. But it's physically represented as a linked list of pointers to characters, which can be slow.

Dex has it easier here. Our native list type is just a length-hidden table, whose physical representation is a contiguous array of unboxed data. So a Dex byte string represented as a List Byte will automatically be stored as a contiguous array of bytes. Great!

But Unicode is harder. The simplest approach would be to treat strings as lists of UTF-32-encoded characters like this:

data Char = MkChar Word32  -- opaque type to represent a single Unicode character
String : Type = List Char  -- ordinary list

But that uses 4X more space than necessary for mostly-ASCII strings. Alternatively, we could use a wrapper around a UTF-8 encoded byte string:

data String = MkString (List Byte)   -- Opaque type to represent a Unicode string

But then we lose the list-of-characters structure. Specifically, we no longer have O(1) random access to characters, which all of Dex's native array-processing idioms are all built on.

Or we could have both. Maybe a three-member zoo isn't so bad?

apaszke commented 3 years ago

To be honest even with UTF-32 encoding you don't get O(1) random access to characters, because the encoding of a single "charater" in the sense that we usually mean it might actually span multiple Unicode codes (they all form a single grapheme).

Citing this SO answer:

For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis ...)

In short, there is no way to reliably support O(1) indexing for Unicode strings, and IMO we shouldn't even pretend that we can do that. There are a lot of writeups from the Swift community about unicode. They've spent a long time thinking about it, and I tend to agree with most issues (although it can have surprising results too, like e.g. system updates changing which strings are equal or not, due to changes in Unicode).

IMO a reasonable plan for now would be to rename Char to ASCIIChar, to make it clear that we don't handle Unicode just yet. Then, we should carefully plan for all the subtleties of Unicode that we're unaware of just yet. A good temporary next step would be to make a String type with the assumption that all graphemes are one code-point wide, but only if we truly do need Unicode support already.

oxinabox commented 3 years ago

JuliaLang went through this a while ago, and I think ended up with a pretty excellent state with just 1 string type (vs it had 2 when I first started using julia). @StefanKarpinski might be able to chime in with some insights and history there. It seems likely that the julia case is incredibly relevent since similar domain (albeit totally different style).

It tooks me a long to to realize even characters are not stored the why i think they are; which I think is a important key trick to get performance out of UTF8.

StefanKarpinski commented 3 years ago

Hi, 👋🏻. I’m one of the co-creators of @JuliaLang and designed and implemented much of its string support. @oxinabox asked me to chime in here with some notes about that design.

The observation that O(1) character (code point) access is not that useful since if you really want to deal with all the possible Unicode-awareness issues, you'll need to do O(n) grapheme iteration of strings anyway and probably do some kind of normalization in the process, is very much on point. Python decided to require O(1) indexing by character (code point) when they redesigned their strings in Python 3: CPython represents each string as a fixed-width encoding where each code point is represented with 1, 2 or 4 bytes depending on what the largest code point value is. As a result, Python needs to transcode every string on the way in and on the way out and unless it happens to be pure ASCII.

A worse side effect of this design, in my opinion, is that it means that Python cannot handle invalid string data. There’s a purist perspective that says that it shouldn’t be possible create an invalid string, but in practice, invalid strings do happen quite a lot and if your string type cannot deal with invalid data, then robust programs that can handle arbitrary data safely cannot use strings, at which point what is the point if a string type? Typical Python programs ignore the possibility of invalid string data and just fail irrecoverably when they encounter it. For example, file systems don't enforce valid file names: Windows will perfectly happily give you an invalid UTF-16 string as a file name and UNIX will give you an arbitrary sequence of bytes that are meant to be displayed as if they're UTF-8, but might be anything. If your string types can't represent invalid strings, then you either can't use strings for file names or there will be situations where you language can't, say, list the contents of the current directory.

Representing strings as an array of code points is tempting, but it has major problems:

  1. It’s inefficient since it requires transcoding strings on the way in and out, which is ok for small strings, but what if you’re dealing with large textual data? Then transcoding twice becomes a real bottleneck.
  2. It makes it impossible to deal with invalid string data, since invalid string data cannot be converted to code points. This majorly reduces the utility of a string type since it can only be reliably be used in programs where you already know that string data is valid. In practice, people ignore this and just write programs that crash when they encounter invalid string data.
  3. It’s not that helpful anyway since if you want to deal with a human conception of “characters” you want to work with normalized graphemes rather than code points, so you need to do O(n) string iteration.

Instead, I recommend leaving string data as-is — just a sequence of bytes. What you need to support strings well is to provide APIs for interpreting those bytes in various useful ways. The following are some of those useful ways:

As a sequence of code units. For ASCII, UTF-8, and Latin-1, the code unit is bytes, or UInt8 as we denote them in Julia. For UTF-16 and UCS-2 the code unit is byte pairs, i.e. UInt16. For UTF-32, the code unit is four-byte sequences, i.e. UInt32. You want to work with code units in order to implement higher level views of strings, like code points, graphemes, etc. Code units are not something you generally want to expose to users, but indexes in terms of units points are the only ones that are O(1), so you do generally want to index in terms of code units.

As a sequence of characters. This is perfectly well-defined for valid strings: each character is a code point as specified in Unicode. For invalid strings, it’s a bit less clear-cut, but the Unicode standard does specify how to split an invalid string into replacement characters, which implicitly gives a way to define what constitutes an invalid character sequence. The basic approach is this: decode well-formed code points (valid or invalid: some well-formed code points are invalid for other reasons, like surrogates in UTF-8 or overlong encodings); as soon as you encounter a code unit that cannot be a valid part of a sequence, the prior sequence is one invalid character and the next code units starts a new character (invalid or not). I can go into more detail if you need, but you can iterate an arbitrary code unit sequence in a well-defined way as a sequence of valid and invalid characters, which each correspond to a sequence of code units in the original string. For UTF-8 a valid or invalid character is 1-4 bytes. For UTF-16, a valid character is 1-2 bytes, and invalid character is always 1 byte (the only way for a UTF-16 character to be invalid is if it’s an unpaired surrogate). For UTF-32, a valid or invalid character is any four bytes code unit. In general, an invalid code unit sequence is at most as long as the longest valid code unit sequence.

As a sequence of graphemes. Grapheme iteration is specified by the Unicode standard. We use the utf8proc library (which was abandoned, but we took over maintenance of it, so now it’s maintained by us). utf8proc is a C library for various Unicode UTF-8 operations like grapheme iteration, uppercasing and lowercasing, etc. A grapheme corresponds to a contiguous sequence of characters in the original string, which can in turn be mapped back to a contiguous sequence of code units in the original string. Any invalid character should be its own invalid grapheme; the latest Unicode standard tells you how to split a sequence of valid characters into graphemes.

There are also “grapheme clusters”, so this isn’t even the end of the line for ways of iterating strings and there’s also normalization, which is something you may or may not want to do, but I think that’s enough for this post.

I believe Rust, Go and Swift generally approach strings along similar lines with slightly different choices of terminology and API details. I don’t know how they handle invalid characters, but Julia’s approach of allowing both valid and invalid characters to be represented by their code unit sequences has been highly successful and makes it easy to write robust string code that can handle invalid data and let the user decide how to handle it. Julia only ships with UTF-8 strings built in and we represent characters as UInt32 values. That UInt32 doesn't represent a code point value, but rather a sequence of 1-4 string bytes corresponding to a valid or invalid character sequence taken directly from the original string. Not only does this allow representing both valid and invalid character sequences, but it also avoids needing to decode UTF-8 into code points when going from string to character and then encode back again when going from character to string. You can do both equality and inequality comparisons of UTF-8 characters directly without needing to decode them with this representation because the ordering of characters is the same as the ordering as byte sequences (this is a very handy little property of UTF-8 that also means you can sort UTF-8 strings with strcmp and get the same order as if you'd lexicographically sorted by code points). In practice we’ve found that hardly anyone needs other string encodings besides UTF-8 these days. There are packages that implement support for strings in other encodings, but you definitely don’t need to build that into the language.

Julia strings are indexed by code unit index but what you get back is a character. This is a slightly odd arrangement since not all indices are valid. If you do something like get an index and then just add or subtract 1 from it, you might get an invalid string index and if you use that, you may get an error. The correct way to do arithmetic with string indices is to use the nextind and prevind functions. In hind sight, it might have been better to make string indices an opaque type that didn’t allow idx + 1 and only allowed nextind(str, idx). The main reason for the design of just using integers as indices is that it’s less suprising and people can write naive string code with integer arithmetic and it works fine as long as the string data is pure ASCII, but to be honest, it ends up being a bit of a trap.

Depending on available compiler technology and performance requirements, there may be better approaches. One option is to let people index by character and just have indexing be O(n) or do some kind of index caching to amortize the O(n) cost. I believe Raku (the artist formerly known as Perl 6) does this. That seems too clever by half to me, but it’s definitely a very Perlish thing to do. Another approach that we may investigate for Julia 2 is to return a StringIndex type that captures the code unit index along with the string that the index refers to, which would allow writing idx + 1 and translating it to nextind(idx.string, idx.index). Julia’s compiler recently gained the ability to stack allocate immutable structures that refer to heap-allocated structures (like strings), so this could potentially be efficient enough now. There are still funky API questions like: what happens if you get an index from one string and use it to index into a different string? Does it do O(n) character indexing? Or does it just throw an error?

It’s probably worth talking about Swift’s approach briefly. Take this with a grain of salt because I’m not a Swift prorgammer and I may be quite wrong or out of date about the details here. Swift essentially skipped all the way to the “grapheme” layer in the above break down of ways of interacting with strings: when you iterate a Swift string, what you get represents a grapheme (cluster?) and is a union of things — it can be a simple character or an entire sequence of bytes since grapheme clusters can be nearly arbitrarily big. And I believe that normalization is done on the fly and automatically when doing things like string comparisons. That approach makes sense if you view Swift as only a language for writing user-facing applications where strings are small and it’s important to get these fussy Unicode things right consistently — in that context you should favor correctness and not care about performance.

But I think Swift has come to regret that as people have started using the language to write server applications and technical computing applications. After all, if you’re trying to imeplement an HTTP server, you really don’t want every string operation to involve grapheme iteration and Unicode normalization — it becomes a performance nightmare. And frankly, most people don’t care about normalization or graphemes. There’s also a question of whether normalization is always appropriate. Sure, people are surprised when two strings that print the same aren’t equal, but is it really right to say that they’re equal as strings? People also expect that if two strings are equal then they are represented by the same sequence of bytes, which isn’t true if you’re automatically normalizing strings. To me the Swift approach seems to take it too far: if I want normalization, I can do it explicitly. Just because two strings print the same way doesn’t mean they’re the same. I also want to be able to work with characters, which can be represented with a fixed size, and not be forced to deal with graphemes, which are arbitrary-length substrings. Again, if I really want to make sure that I’m working with how the user percieves the data, I’ll normalize it and use a grapheme API.

To summarize:

I think that about covers it.

dougalm commented 3 years ago

Hi Stefan, thanks for your extremely thorough write-up.

You point out that it's often possible to tolerate invalid Unicode-like data rather than throwing an error immediately. This is interesting because it's a counterpoint to the "parse, don't validate" philosophy. To me the obvious design, putting aside performance considerations, would have been something like this:

data CodePoint =  ...
ByteString    : Type = List Word8
UnicodeString : Type = List CodePoint

decodeUTF8 : ByteString -> Maybe UnicodeString

But your point is that this is too conservative. If your OS gives you an invalid sequence of bytes as a file name, there's no way to store it as a Unicode string like this, which is a shame if you just wanted to print it out. I think you're arguing for a notion of a "character" that includes invalid data, something like Maybe Codepoint (or Either ByteString CodePoint). Then "decoding" is total (won't throw an error) and perhaps even free at runtime if you represent Char as an undecoded UTF-8 byte string:

Char : Type = Maybe CodePoint   -- This is the logical view, but physically it could be a ByteString
LiberalUnicodeString : Type = List Char

decodeUTF8Liberal : ByteString -> LiberalUnicodeString  -- no need for a `Maybe` anymore

We'll have to think about it. Part of the appeal of a static type system is precisely to avoid invalid data. If UNIX file names are truly byte strings rather than Unicode strings, then arguably we should be representing them as byte strings.

The performance issues are another matter. As you say, for best performance, you'd like to represent CodePoint using List Byte instead of Word32. The Dex compiler currently boxes lists, so a List (List Byte) would be an array of pointers, which is no good. But there's an alternative representation we've played with, which is to store a list of lists as a contiguous array of all the elements, with a separate array of integers giving the offsets to each inner list. Viewed as a byte array, the contiguous array of elements would be identical to the UTF-8 encoding, so there wouldn't be any decoding costs. You'd also get O(1) character lookups via the lookup array of offsets.

StefanKarpinski commented 3 years ago

If UNIX file names are truly byte strings rather than Unicode strings, then arguably we should be representing them as byte strings.

Here's the thing: yes, UNIX file names are truly byte strings. So is any string you read from any input source until you validate it. You could just represent it as a vector of bytes. But that's not very convenient. If you read a file name (or a line) and want to do something with it before validating it, you still want it to behave like a string. I don't know if you have a REPL but if you do, you'd want this unvalidated string to print itself like a string in the REPL instead of printing itself like a byte vector — because reading a file name and seeing [82, 69, 65, 68, 77, 69, 46, 116, 120, 116] isn't very helpful, whereas seeing "README.txt" is much more helpful. So what you really want for unvalidated string input is a type for "vectors of bytes that should be interpreted in a UTF-8-like manner but may not be valid UTF-8". And once you have that, why not just make that your string type? What operations can you do with a validated string that you can't do with an unvalidated one? There's a few, but not as many as you'd think. For example, I've described how you can iterate code units, characters and graphemes coherently from an unvalidated string.

The only point where you really need to worry about validity is the point where you ask a question like "what is the code point of this character?" At that point, you would have an operation that returns Maybe CodePoint and you'd have to handle the possibility that the character is invalid and doesn't correspond to a code point. This is really the point: deferring forcing the user to deal with invalid string data when it actually matters makes working with strings much more pleasant and robust. Consider the alternatives (using readline and readdir as examples):

  1. Throw an error if readline or readdir encounter invalid strings. This means that you cannot write programs that can read arbitrary files and directories and more generally, if you write a server it will just crash if someone sends it an invalid response.
  2. Make readline and readdir return byte vectors instead of strings. The user is expected to check each returned byte vector for validity and then convert them to strings only if they are valid.
  3. Make readline and readdir return unvalidated strings and defer forcing the user to deal with potential lack of validity only if they do something what that string data that requires that it be valid.

Option 1 seems like obviously a non-starter (despite Python taking this approach; it does at least provide options to return "byte strings" instead). Option 2 is possible and seems like what, @dougalm, you're suggesting. But it's fairly annoying. I can promise that users are going to be annoyed and wonder why this programming language returns vectors of bytes all over the place instead of returning strings like normal languages do.

That leaves option 3. A crucial point regarding this option is that dealing with invalid string data is very much an "if" not a "when": there are a surprisingly few string operations that you cannot sensibly do with invalid string data. You can concatenate and search invalid string data just fine. If you concatenate a valid string with an invalid string, you get another invalid string, but it's still the right concatenation. Consider calling readdir(dir) and getting an invalid file name, name. What do you do with that file name? You probably do something like path = "$dir/$name" and then call open(path). Even if name is invalid and path is therefore invalid too, it is still exactly the path that the file system needs to open that file. What about other operations like string splitting? Suppose you split path into individual path components by splitting it on the character /? Is this a problem if path is invalid? It isn't at all as long as you consider invalid characters to be unequal to /. The normal string splitting logic works and you get the right path components and the last one is name — exactly the invalid file name you got from readdir.

There are, of course, situations where you do need to deal with invalid string data. Suppose you're parsing a string into an integer: you have to iterate characters and check that each one is between 0 and 9 (throw an error otherwise) and then subtract the code point of the character from the code point of 0. Here you do need to make sure the character is valid since if it isn't, it's not sensible to ask if its code point is between the code points of 0 and 9 let alone subtracting the code point of 0 from its code point. But that situation would have been a parsing error anyway because a string representing an integer also has to be valid Unicode.

You might also want to ensure that all strings are valid before saving them somewhere, which you can of course do — just test if they are valid or not. If you want to encode validity at the type level, you can also do that by having a valid string type, such that validity is checked when converting from unvalidated strings to the validated string type.

Bottom line: it is normal, not exceptional, to deal with potentially invalid string data coming from external sources; don't make it unnecessarily annoying or difficult. In particular, don't force the user to deal with invalid characters until they actually have to do so. Allow string operations like concatenation, splitting and search to work on invalid strings — just concatenate the underlying data, etc. Only force users to deal with invalid data when they do something that cannot be sensibly defined for invalid data, like computing a code point.

apaszke commented 3 years ago

Wow, thanks a lot for putting in the time to write such extensive comments @StefanKarpinski! I do agree with the case you make about supporting invalid data in strings as a good default. This is crucial if we ever want Dex to interact with other systems (even the file system!). Those were often written even in pre-Unicode times, and so byte strings are the natural units they work with internally, even though this is not the abstraction that the users expect. But it doesn't make sense to be overly formal and bury everyone in a ton of explicit {en|de}coding.

I think that a reasonable plan might be something like the following (we might want to wrap some types in newtypes, but I skipped those for the simplicity of presentation):

-- Byte strings
Byte = Word8
ByteString = List Byte

-- (Usually) Human-readable strings
def StringIndex (s: String) = <an index set that's a compiler builtin;
                               the unit of iteration is a single code point
                               (or a sequence of invalid bytes)>
String = <compiler builtin, a list of bytes that represents a potentially invalid UTF-8 sequence>
Character = Word32 -- a code point, or a cluster of invalid bytes
CodePoint = Word32 -- a valid code point

indexString : (s: String s) -> StringIndex s -> Character
codepoint : Character -> Maybe CodePoint

Then, if the Maybes become annoying, we can always add a type such as UTF8String which requires the list of bytes it holds to be composed entirely of valid code points.

We can easily emit good code for for pos:(StringIndex s) by implementing them as stateful loops, with no very special caching. Of course random access via str.(intIndex@(StringIndex str)) is still possible and would be O(n), but we could even emit a compiler warning in those cases, to suggest people that this might not be the best idea.

One interesting issue with those loops is that unlike with all other index sets, they are more difficult to parallelize, because the computation of size (StringIndex s) is serial. Not sure what to do about that...

StefanKarpinski commented 3 years ago

Seems like a good plan. I think that as long as the Maybe is only at the point where you want to call codepoint on a Character you'll find that it's not too terribly annoying so you may not even need a fully validated UTF8String type.

ScottPJones commented 3 years ago

@apaszke I would make a case for keeping binary strings, text strings which are not validated or whose encoding is unknown (this is a very frequent case, where people have data in 8-bit character sets such as ANSI Latin-1 or Windows CP1252), and then validated Unicode strings (possibly UTF-8 encoded) separate.

I'd recommend looking at the Unicode standard section on conformance: http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404 and also the security issues: http://unicode.org/reports/tr36/#UTF-8_Exploit

Instead of the 3 options Stefan presented above, I'd recommend a 4th, which is to use a union type, so that each string returned from something like readdir or readlines is typed as either "Text" or "UTF8String". That way, you can still deal with potentially invalid data, without either simply ignoring the issue (dangerous), immediately erroring (annoying), or forcing the user to always do the conversions from byte vectors to strings.