krakjoe / ustring

UnicodeString for PHP7
Other
64 stars 7 forks source link

Codepoints, not characters #18

Open hikari-no-yume opened 9 years ago

hikari-no-yume commented 9 years ago

Since this doesn't operate on grapheme clusters, we should refer to dealing with codepoints, not characters.

In particular, documentation comments need changing, and charAt should be codepointAt.

hikari-no-yume commented 9 years ago

We should probably also note when dealing with codepoints won't do what you expect, e.g. for reverse.

krakjoe commented 9 years ago

Agree, do it if you have the time, or else I'll get to it when I come back to ustring ...

hikari-no-yume commented 9 years ago

Hmm. In JavaScript, codePointAt returns a number, not a one-char string. Maybe we should have it return a number here, since for getting a one-char string there's always []?

Also, I'm not sure it's incorrect to refer to "Unicode characters" here but it might be confusing (if "á" is two "characters" for example).

mathiasbynens commented 9 years ago

The name charAt makes me expect a string (or a UString of course) containing the actual character, while codePointAt makes me expect a numeric code point value. That’s how it works in JavaScript.

I definitely agree that if we’re gonna have charAt (which is already available in UString) we should add codePointAt as well.

hikari-no-yume commented 9 years ago

Yeah, that makes sense. But what I'm wondering about more generally is whether the word "character" is confusing, as I think people might expect that to mean visual graphemes/glyphs and not codepoints.

For example, is describing $ustring->length as returning the "number of Unicode code points" better than describing it as returning the "number of characters"?

mathiasbynens commented 9 years ago

For example, is describing $ustring->length as returning the "number of Unicode code points" better than describing it as returning the "number of characters"?

Ah, yes, that’s definitely clearer.

AFAIK there is no official term in the Unicode standard that means “character that corresponds to a single code point”. I use the term “symbol” for it, but that’s just me.

hikari-no-yume commented 9 years ago

Yeah, I was wondering if there was such a term, but there doesn't seem to be. "Code points" sort of works for that, but really a code point is just the number assigned by Unicode. I know that Go uses "rune" as an alias of "code point": http://blog.golang.org/strings#TOC_5.

hikari-no-yume commented 9 years ago

So perhaps UString should say something like this in its docs: "A UString represents a Unicode string, that is, a sequence of Unicode code points. Code points do not necessarily map 1:1 to visual characters. UString methods operate on and in terms of Unicode code points, rather than characters, except where otherwise noted." ?

hikari-no-yume commented 8 years ago

Starting to think our current API isn't very useful. It doesn't really handle Unicode much better than passing around UTF-8 strings, because there's no real consideration for the difference between characters and Unicode scalar values, Unicode equivalence, and so on. Our API won't let you split strings on actual character boundaries, compare canonically equivalent strings, count the number of actual characters in a string, etc. It does let you work with codepoints, but that is barely better than bytes.

I think we should rethink things along the lines of Swift's API: https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html