Closed ricardoboss closed 1 year ago
For these:
contains startsWith endsWith filtered sorted? compareTo
, you probably also want to normalize first. C# example: https://sharplab.io/#v2:C4LgTgrgdgNAJiA1AHwAICYCMBYAUH1AZgAINiBhPAbz2LtPXVvpt3vdMwAZiBDYgLzEARAEvhAbmYdOPAEaCRAU0CAwJOkcN7VJgCcACl4BKKWxmcDck1vo6D/AUKunzFwwDoAcgHswAW14AGwBLAC8lfR13ABUlAA9gL18AkNDeYGDvKAAxZPdc/wARI0EBG1cK+jkk/yCwiKjYhJqUsPTMnLyCv2LrM0qB+g0AXzxhoA=
(What's going on here? This:
string text = "é";
string text2 = "e\u0301";
The first string is a pre-combined letter. The second one is a grapheme cluster consisting of the letter, then a combining diacritic.)
Good catch, @chucker!
@chucker any specific reason to use NormalizationForm.FormD
?
From the docs at https://learn.microsoft.com/en-us/dotnet/api/system.text.normalizationform?view=net-7.0, I think I'd agree that FormD
might be best, because it doesn't drop formatting information and doesn't perform any unnecessary replacements:
[From
FormC
:] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.Unicode defines two types of decompositions: compatibility decomposition [
FormKD
] and canonical decomposition [FormD
]. In compatibility decomposition, formatting information might be lost. In canonical decomposition, which is a subset of compatibility decomposition, formatting information is preserved.
I would argue that D is the “nicest” and most forward-looking form. In my example above, D
will decompose the base latin letter and its accent into two separate characters, whereas C
will do the opposite. IOW, with D
, text
becomes text2
(LATIN SMALL LETTER E
followed by COMBINING ACUTE ACCENT
), whereas with C
, text2
becomes text
(LATIN SMALL LETTER E WITH ACUTE
).
But it's mostly a matter of preference.
This also affects reversed. A grapheme cluster will not reverse correctly if you reverse at the code point level.
reversed
currently doesn't support strings, funnily enough. I'll add it including grapheme support.
I stumbled upon this article, which discusses how strings are represented using encoding, code points and graphemes: https://tonsky.me/blog/unicode/
To make STEP more accessible, I think we should change how certain functions handle string values, to work with graphemes instead of bytes.
The functions that need to respect graphemes include:
length
substring
indexOf
contains
startsWith
endsWith
compareTo
reversed
str[idx]