stdlib-js / stdlib

✨ Standard library for JavaScript and Node.js. ✨
https://stdlib.io
Apache License 2.0
4.21k stars 412 forks source link

Tracking issue: refactor string packages handling grapheme clusters in terms of "base" packages #1062

Open kgryte opened 12 months ago

kgryte commented 12 months ago

The purpose of this issue is to track tasks related to the effort to refactor string packages handling grapheme clusters to use "base" packages which handle more specialized use cases.

Overview

String packages, such as @stdlib/string/first, have several possible "modes" of operation. When getting the first character, a straightforward approach would use indexing. E.g.,

var str = 'Hello, World!';

var ch = str[ 0 ];
// returns 'H'

This works according to user expectation so long as a character is a relatively common character which can be stored in a single UTF-16 code unit. However, this inevitably does not live up to user intuition when the first visual character is comprised of multiple code units.

As such, one has three options for resolving the first character:

The most robust approach for matching user intuition is to resolve grapheme clusters (i.e., user-perceived visual characters), especially for text which may include emojis with skin tones and modified characteristics. However, resolving grapheme clusters is comparatively slow and may lead to unacceptable performance issues, especially when working with simple text.

Solution

Rather than provide a single API which only processes text as a sequence of grapheme clusters, the proposed solution is to refactor top-level @stdlib/string/* packages which handle grapheme clusters to support different "modes" of operation, whereby a user can choose which type of processing is most appropriate for given input strings.

Internally, packages supporting different modes should rely on separate, specialized "base" packages (@stdlib/string/base/*) which implement appropriate algorithms for resolving code units, code points, and grapheme clusters, respectively.

Prior Art

For examples of refactorings, see

Tasks

The following packages should be refactored to use the proposed solution:

The following package implementation needs to be rewritten:

Notes

In general, refactoring should happen in the following order:

  1. Create the base package processing grapheme clusters (package name should have a -grapheme-cluster or -grapheme-clusters suffix). This is often similar to the top-level package, but stripped of input argument validation and optional arguments.
  2. Create the base package for processing Unicode code units (package name should have a -code-point or -code-points suffix).
  3. Create the base package for processing UTF-16 code units (if necessary, package name should have a -code-unit or -code-units suffix).
  4. Refactor the top-level package to depend on the base packages and add support for specifying a mode option.
steff456 commented 10 months ago

We also need to fix the implementation of prev-grapheme-cluster because it is not having the same results as next-grapheme-cluster. A fix for this bug is needed and I believe that we can create a more performant implementation for this package as well by using the package that returns the complete number of grapheme clusters and then traverse the string from right to left by aggregating from the given index and checking that the result is still a grapheme cluster as a whole.