numbas / Numbas

A completely browser-based e-assessment/e-learning system, with an emphasis on mathematics
http://www.numbas.org.uk
Apache License 2.0
205 stars 120 forks source link

Should len(string) return the number of graphemes, or the number of codepoints? #967

Open christianp opened 1 year ago

christianp commented 1 year ago

In unicode, graphemes might be represented by a sequence of several codepoints. For example, the emoji 🫶 is two codepoints: \ud83e\udef6.

Should the length of a string in JME count graphemes or codepoints? I think the least-surprising answer from a human's perspective is graphemes, but that means that all the methods for indexing and slicing strings need to be grapheme-aware.

christianp commented 1 year ago

Blog posts on how this is dealt with in different languages:

Libraries to deal with grapheme clusters:

There is a proposal to add an Intl.Segmenter interface to JS to deal with this.