microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
https://www.typescriptlang.org
Apache License 2.0
101.04k stars 12.49k forks source link

Capitalize does not work for characters which are 2 utf-16 code units #55677

Open CraigMacomber opened 1 year ago

CraigMacomber commented 1 year ago

🔎 Search Terms

Capitalize, utf-16, string, type

🕗 Version & Regression Information

⏯ Playground Link

https://www.typescriptlang.org/play?#code/PQKhCgAIUhhBDADgSwC7wDbIF4FNLyQDOqATsgHYDmAdFCMOAGYCuFAxqsgPYWTtI0mHLgA8AZUi4AHqlwUAJkWJlKVAHwAKIgC5I4gJR6EKdFjwT1kAN7gAkMGCQAKtwIA3bsgXFEWVFzU-AAW8KTwnLikygDuwcjswQSk+AC28Ar43EyQqSwYXH74AKrOAGIAtACMAGz83JmQbGhEADT2jk1E+GhR8Fy8kJQkuBmQ2UOKMmqQqG7diGH9+KjB+EzI0aghS5GkdHbsvCRDcuFyPgC8xADa4gCeqQBG3Bg0vefcpAC6mgY0FBkqD+AG57MgcpoPssFDQFLx8JckbNSCxcAYbPYHE4AKKpRCoe4qchBATdA52FKoFikPgAIjpBGUJiE5jE4nUYLsAF9wPYqTS+FCzjCaO5MGiaHNiohEFEEN0-pAANTEGhELDsXDCvoXMUS3A0DDyKirAwY+DMwRmESWMG88BHCgnBT9eB6G4kElUVrEtTfG7fSDXG5QSDhm4M30M77tcMRunwOnRgCCdNjYYT8CTqaTGfjkE6ADleBVLexkMhM5BIwBH6TJyB0gCLDfz8c6ABluDEomT8MaAlFZqFtshlAAmSClSq1eqNZqoXTVyOALg3AMK7jbpq8AhLvpuPxtfrnNN3d59rfMFML6QTROk43SiIFioX0yOWRBRBiau9AY2wFveryGhg3BUJoAAGT4vnodIACTWNBqDco2KREPkqCwQhAimMIeBQhQz6oAYKFvtIH4XFh1jvrgn4oRBBj2ny4CEnKLi4CQogAJKES+UiyPISh+tQvo4uRtEXPxciKFauFstxvGoOoVjXCyNoWDxRGcsxrH4AA+lUwbsZxUZNnS2m6ZAelTtczgcagoiJluaYWfcbF6QAzEZdkmdmzlJq57kACzefZjn1lurbmWC4CdM4azdPwlocZA8IUAA5NsMRfAA1nQll6QArKFJkbluu7RSxbn6XUtlhdux7lTuAUgkAA

💻 Code

/**
 * Capitalize a string.
 */
function capitalize<S extends string>(s: S): Capitalize<S> {
    // To avoid splitting characters which are made of multiple UTF-16 code units,
    // use iteration instead of indexing to separate the first character.
    const iterated = s[Symbol.iterator]().next();
    if (iterated.done === true) {
        // Empty string case.
        return "" as Capitalize<S>;
    }

    return (iterated.value.toUpperCase() + s.slice(iterated.value.length)) as Capitalize<S>;
}

const data: [string, string][] = [
    ["", ""],
    ["a", "A"],
    ["aa", "Aa"],
    // Non-ascii
    ["ñx", "Ñx"],
    // Lowercase letter that is 2 UTF-16 code units:
    ["𞥃", "𞤡"],
    ["𞥃a", "𞤡a"],
];
for (const [input, expected] of data) {
    console.log(`input: "${input}", result: "${capitalize(input)}", expected: "${expected}"`);
}

type Test<Input extends string, Expected extends Capitalize<Input>> = Capitalize<Input>;

type _1 = Test<"", "">;
type _2 = Test<"a", "A">;
type _3 = Test<"aa", "Aa">;
type _4 = Test<"ñx", "Ñx">;

// These cases don't work.
type _5 = Test<"𞥃", "𞤡">;
type _6 = Test<"𞥃a", "𞤡a">;

🙁 Actual behavior

Capitalize appears to capitalize theString[0] which is not a correct implementation due to mishandling of characters which require two utf-16 code units and thus take up two indexes in the string.

For example, Capitalize<"𞥃"> gives "𞥃";

🙂 Expected behavior

When the first character in the string passed to Capitalize is in the Unicode "Lowercase Letter" category, that character should be replaced with its corresponding uppercase version.

For example, Capitalize<"𞥃"> should give"𞤡";

Additional information about the issue

The included code provides a tested correct implementation for runtime use. This implementation is taken from my work here: https://github.com/microsoft/FluidFramework/blob/ddee06b429b26b571b7383d4acd95a62d3b9758c/experimental/dds/tree2/src/util/utils.ts#L394

That code is available under the MIT license and already copyright Microsoft, so feel free to use it in the TypeScript implementation if you find it useful to do so.

This will be a breaking change, but I highly doubt it will impact real use.

RyanCavanaugh commented 1 year ago

If we change this, we should revamp all string handling to be character-based instead of index-based for consistency. e.g. this today is also wrong by the same token:

type A = "𞥃";
type First<T extends string> = T extends `${infer F}${string}` ? F : never;
type F = Uppercase<First<A>>;

I'd like to understand more how you encountered this in real code

CraigMacomber commented 1 year ago

I'd like to understand more how you encountered this in real code

I'm working on a system which generates APIs based on schema (similar to what TypeBox does, but more specialized), I wanted to add methods with named derived from the names in the schema (like convert "foo" into "boxedFoo") since I know TypeScript can do this now (which is pretty neat and works surprisingly well).

There are two parts to this: the Types part using Capitalize ( https://github.com/microsoft/FluidFramework/blob/ddee06b429b26b571b7383d4acd95a62d3b9758c/experimental/dds/tree2/src/feature-libraries/editable-tree-2/editableTreeTypes.ts#L227 ) and the runtime part which I had to implement my own capitalize function for ( https://github.com/microsoft/FluidFramework/blob/ddee06b429b26b571b7383d4acd95a62d3b9758c/experimental/dds/tree2/src/feature-libraries/editable-tree-2/lazyTree.ts#L436 )

Since the field names in the schema come from users of our library, and I'm personally very aware of JS utf-16 string issues, I hoped the standard library would have a capitalize operation so I wouldn't have to deal with this. Searching for that lead me to https://stackoverflow.com/questions/1026069/how-do-i-make-the-first-letter-of-a-string-uppercase-in-javascript which I immediately realized was incorrect. I then wrote my own implementation without the bug (including tests covering that case), and wondered if possibly TypeScript had the same bug in their version. I checked and found that it did.

Since my implementation is correct (as far as I can tell anyway), and does not match TypeScript's behavior, a user of our Schema system that started a field name with a lowercase character that is two utf-16 code points would get incorrect type information for the generated API. As our library is still in the alpha stage and all our schema are in English, we absolutely haven't hit this, but its possible that we could some day.