Should len(string) return the number of graphemes, or the number of codepoints?

numbas / Numbas

A completely browser-based e-assessment/e-learning system, with an emphasis on mathematics

http://www.numbas.org.uk

Apache License 2.0

205 stars 120 forks source link

Should len(string) return the number of graphemes, or the number of codepoints? #967

Open christianp opened 1 year ago

christianp commented 1 year ago

In unicode, graphemes might be represented by a sequence of several codepoints. For example, the emoji 🫶 is two codepoints: \ud83e\udef6.

Should the length of a string in JME count graphemes or codepoints? I think the least-surprising answer from a human's perspective is graphemes, but that means that all the methods for indexing and slicing strings need to be grapheme-aware.

christianp commented 1 year ago

Blog posts on how this is dealt with in different languages:

Libraries to deal with grapheme clusters:

Python - https://pypi.org/project/grapheme/
JavaScript - https://github.com/orling/grapheme-splitter

There is a proposal to add an Intl.Segmenter interface to JS to deal with this.