Better handling for text

ricardoboss / STEP

The STEP programming language

https://ricardoboss.de/STEP/

MIT License

5 stars 1 forks source link

Better handling for text #92

Closed ricardoboss closed 1 year ago

ricardoboss commented 1 year ago

I stumbled upon this article, which discusses how strings are represented using encoding, code points and graphemes: https://tonsky.me/blog/unicode/

To make STEP more accessible, I think we should change how certain functions handle string values, to work with graphemes instead of bytes.

The functions that need to respect graphemes include:

length
substring
indexOf
contains
startsWith
endsWith
compareTo
reversed
index operator str[idx]

chucker commented 1 year ago

For these:

contains startsWith endsWith filtered sorted? compareTo

, you probably also want to normalize first. C# example: https://sharplab.io/#v2:C4LgTgrgdgNAJiA1AHwAICYCMBYAUH1AZgAINiBhPAbz2LtPXVvpt3vdMwAZiBDYgLzEARAEvhAbmYdOPAEaCRAU0CAwJOkcN7VJgCcACl4BKKWxmcDck1vo6D/AUKunzFwwDoAcgHswAW14AGwBLAC8lfR13ABUlAA9gL18AkNDeYGDvKAAxZPdc/wARI0EBG1cK+jkk/yCwiKjYhJqUsPTMnLyCv2LrM0qB+g0AXzxhoA=

(What's going on here? This:

    string text = "é";
    string text2 = "e\u0301";

The first string is a pre-combined letter. The second one is a grapheme cluster consisting of the letter, then a combining diacritic.)

ricardoboss commented 1 year ago

Good catch, @chucker!

ricardoboss commented 1 year ago

@chucker any specific reason to use NormalizationForm.FormD?

ricardoboss commented 1 year ago

From the docs at https://learn.microsoft.com/en-us/dotnet/api/system.text.normalizationform?view=net-7.0, I think I'd agree that FormD might be best, because it doesn't drop formatting information and doesn't perform any unnecessary replacements:

[From FormC:] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.

Unicode defines two types of decompositions: compatibility decomposition [FormKD] and canonical decomposition [FormD]. In compatibility decomposition, formatting information might be lost. In canonical decomposition, which is a subset of compatibility decomposition, formatting information is preserved.

chucker commented 1 year ago

I would argue that D is the “nicest” and most forward-looking form. In my example above, D will decompose the base latin letter and its accent into two separate characters, whereas C will do the opposite. IOW, with D, text becomes text2 (LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT), whereas with C, text2 becomes text (LATIN SMALL LETTER E WITH ACUTE).

But it's mostly a matter of preference.

chucker commented 1 year ago

This also affects reversed. A grapheme cluster will not reverse correctly if you reverse at the code point level.

ricardoboss commented 1 year ago

reversed currently doesn't support strings, funnily enough. I'll add it including grapheme support.