qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Split a string by graphemes #73

Closed rhdunn closed 3 months ago

rhdunn commented 3 years ago

The new fn:characters function is useful, but doesn't solve a problem of manipulating strings where multiple codepoints correspond to a single grapheme. For example:

  1. characters with one or more combining characters;
  2. emoji with skin tone variant selectors;
  3. emoji with gender variant selectors;
  4. multi-sequence emoji -- family, wales flag, etc.;
  5. region indicator pairs for flags.

Getting this right is complex, and implementing it as a regular expression is easy to get wrong/make mistakes.

fn:graphemes

Summary

Splits the supplied string into a sequence of single-grapheme (one or more character) strings.

Signature

fn:graphemes($value as xs:string?) as xs:string*

Properties

This function is ·deterministic·, ·context-independent·, and ·focus-independent·.

Rules

The function returns a sequence of strings, containing the corresponding ·grapheme· in $value. These are determined by the corresponding Unicode rules for what constitutes a ·grapheme·. The version of Unicode and the Unicode Emoji standards is ·implementation-dependent·.

If $value is a zero-length string or the empty sequence, the function returns the empty sequence.

Examples

The expression fn:graphemes("Thérèse") returns ("T", "h", "é", "r", "è", "s", "e"), irrespective of whether the e characters use combining characters or not.

The expression fn:graphemes("") returns ().

The expression fn:graphemes(()) returns ().

The expression fn:graphemes("👋🏻👋🏼👋🏽👋🏾👋🏿") returns ("👋🏻", "👋🏼", "👋🏽", "👋🏾", "👋🏿").

The expression fn:graphemes("👪") returns ("👪").

The expression fn:graphemes("👨‍🔬👩‍🔬") returns ("👨‍🔬", "👩‍🔬").

The expression fn:graphemes("🇪🇪🇩🇪🇫🇷🏴󠁧󠁢󠁷󠁬󠁳󠁿🇮🇸") returns ("🇪🇪", "🇩🇪", "🇫🇷", "🏴󠁧󠁢󠁷󠁬󠁳󠁿", "🇮🇸").

rhdunn commented 3 years ago

It might also be worth adding a note to fn:characters about this issue and referencing fn:graphemes for the use cases where preserving graphemes is required.

Conal-Tuohy commented 3 years ago

It would be good to spell out the Unicode blocks of the combining characters, variation selectors, etc. Some of the current XPath functions are spelled out as "equivalent of the following function: ... " and this could be doable for fn:graphemes, too, I think.

"Text" and "emoji" variation selectors would be another good example to include:

The expression fn:graphemes("♎♎︎") returns ("♎", "♎︎").

rhdunn commented 3 years ago

The http://unicode.org/reports/tr51/ document should be referenced, which details how to identify an emoji grapheme. Which version is used should be implementation dependent.

liamquin commented 3 years ago

Are you sure you mean grapheme and not grapheme cluster here?

rhdunn commented 3 years ago

Ah yes, you are right. Interestingly, Unicode supports two grapheme cluster modes (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) -- legacy grapheme cluster, and extended grapheme cluster -- in addition to the emoji rules in TR51.

As a result of this, it may be more useful to extend fn:characters with an options map. That could have two options:

record(grapheme-cluster as enum("codepoint", "legacy", "extended"),
       emoji as xs:boolean)

If grapheme-cluster is "codepoint", it works as $value => string-to-codepoints() -> codepoints-to-string(). If it is "legacy" or "extended", it follows the corresponding Unicode algorithm in TR29.

If emoji is true, then it follows the TR51 rules for segmenting emoji sequences.

If we want the default to be a simple conversion, the options could default to map { grapheme-cluster: "codepoint", emoji: false() }, otherwise, they could default to map { grapheme-cluster: "extended", emoji: true() }.

The exact behaviour would be implementation dependent and depend on the Unicode/Emoji version supported by that implementation.

rhdunn commented 3 years ago

I based the function name on http://www.unicode.org/glossary/#grapheme, specifically:

(2) What a user thinks of as a character.

michaelhkay commented 2 years ago

I love the phrase "what a user thinks of as a character". I don't imagine many users think a pile of poo is a character. They might think that Sherlock Holmes is.

duncdrum commented 2 years ago

This is quite interesting but we will need to ensure that graphemes in the context of CJK and unihan return what the users would expect. I ll try to come up with a few examples.

ChristianGruen commented 10 months ago

I wonder if this feature isn’t too sophisticated to be added to the spec. What do others think?

michaelhkay commented 10 months ago

Personally, this is not something I have ever felt a need for. I'm open to persuasion on that, though I'm aware that when one person enthusiastically wants a feature, and everyone else doesn't see the need for it, there is a tendency to add it, causing feature creep. But I also need convincing that specifying it, implementing it, and testing it are reasonably feasible. A few features like parse-html and parse-uri are inevitably going to be difficult, but we only want to tackle difficult issues if there's a high benefit.

Arithmeticus commented 10 months ago

Let me check with the TEI linguistic community and gauge their interest.

bansp commented 10 months ago

Regarding "people who may see the need for it", the following is a fragment of an e-mail that I have found interesting and kept aside for when I get a moment to research more. It might be relevant to the issue at hand (and I'm hoping for your patience in case it's completely immaterial), despite mentioning Java and Python -- because others would want to use purely XML-based solutions here:

(TL;DR? skip to point 2.)

  1. Indic script

I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string. We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?

(By Anil Singh, in a message to Corpora-l)

  1. Arabic script

I would like to see if it is feasible to consistently get the same string length for the following variants of the same word (shukran, 'thanks'): شكرا and شُكْرًا. The latter example uses some diacritics (there can also be examples with more diacritics in a single grapheme; they can stack), not only for short vowels, but also for the final "n" (and also for the absence of a vowel, between "k" and "r"). And, naturally, they are bound to be produced by various methods. The result, in both cases, should be four, irrespective of how the second form is constructed.

If fn:characters were to differ from fn:graphemes in this case and/or for the Indic example (consistently) , then that might indicate a benefit in keeping the latter function in.

If, like me, you're not exactly eagle-eyed, you might appreciate the screenshot enlarging the squiggles: image

(despite appearances, there is no whitespace before the final ا -- the whole thing is a single word)


Apologies if this is not on topic (I do hope it is, and will be curious to learn why it isn't, if it isn't -- even if by following a pointer, so thanks in advance).

Arithmeticus commented 10 months ago

Thinking about Piotr's examples, and my own in Greek and Syriac, I see in fn:grapheme clear benefits for those who work with non-Latin scripts. Easy for me to say, but the implementers should say whether it is manageable.

If the functionality is approved by the CG, I would prefer to see it as its own standalone function, and not packed as a map option into fn:characters: the function's name is misleading, and functions with parameters that expect a map can be a hassle to use. To support the two UAX 29 rule types, fn:grapheme could be extended to arity two with the parameter $extended as xs:boolean() := true(). (Crossing my fingers that Unicode doesn't introduce a third type of grapheme cluster.)

In reading UAX#29 I can't help but also suggest that we consider introducing the functions fn:words and fn:sentences. The caveats in UAX 29 would have to be iterated, but the result would provide significant utility to a very broad range of users, including the majority who work only in Latin scripts.

And XPath would finally have a function that begins with 'w'.

duncdrum commented 10 months ago

For cjk string manipulation unihan compliant fn:grapheme () would be highly useful. I ll gladly come up or review examples. ( not now on my phone)

@michaelhkay @ChristianGruen I don’t think of this as too sophisticated. The lack of perceived need seems to me accidentally based on the linguistic composition of the working group.

michaelhkay commented 10 months ago

I'm wondering if splitting text into graphemes could be presented as a use-case for invisible XML?

The thing that always worries me about features like this is that the WG doesn't have the expertise to get the specification right. It's bad enough with collations -- we do exactly what UCA says, and it turns out to not to meet users' needs.

liamquin commented 10 months ago

On Tue, 2023-10-31 at 12:52 -0700, Joel Kalvesmaki wrote:

In reading UAX#29 I can't help but also suggest that we consider introducing the functions fn:words and fn:sentences. The caveats in UAX 29 would have to be iterated,

In particular, it doesn't work for huge numbers of people in the world unless your implementation has a dictionary (e.g. for China, Japan, Thailand). On the other hand it could use the same definition as regular expressions (\b \< > \w \W in most systems).

If these various functions are added, there should be support in regular expressions too (do we have \X already? see e.g. [1]

Sentences, Mr. Kalvesmaki, are harder :-).

[1] https://stackoverflow.com/questions/53198407/is-there-a-regular-expression-which-matches-a-single-grapheme-cluster

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

duncdrum commented 10 months ago

If these various functions are added, there should be support in regular expressions too (do we have \X already?

RegEX support would be fantastic. I don't think we do, another syntax suggestion \g , see https://www.unicode.org/reports/tr18/#RL2.2

ChristianGruen commented 9 months ago

No further comments here for the last 6 weeks… Do we believe someone would be ready and willing to create a proposal?

Arithmeticus commented 9 months ago

I would be willing to do so, but only if (1) a standard function would be significantly more performative than one written by users and (2) there are but dim prospects for development of an ecosystem that allows independent QT libraries to flourish. (See thread "packaging".)

My personal preference is that a community of linguists develop grapheme and related functions. But (to restate the points in the previous paragraph) I would reconsider that recommendation if performance would be suboptimal, or if an independent library of linguistic functions would lie in oblivion.

rhdunn commented 9 months ago

The issue isn't really whether the user can do this efficiently, but whether they can do it accurately. Especially when writing this in pure XPath/XQuery/XSLT.

The information needed for this is in the Unicode Character Database (UCD) and the algorithms specified in the relevant Unicode TRs.

Doing this properly would likely involve including an external library such as https://icu.unicode.org/. This is difficult to do outside of the processor, and implementors will already be including this data for other functions such as upper/lower case conversion and regex script/general category selectors (\p{Latn}, \p{Lu}, etc.). As such, this would be easier to have processor support for than doing it in an external component.

Arithmeticus commented 6 months ago

It is worth looking at parallel efforts to fn:grapheme: in Rust and Python.

Unicode provides some excellent resources (links to UCD 15.1):

  1. Grapheme break tests
  2. Guide to grapheme breaks
  3. Grapheme break properties

Having looked more closely at the algorithm, I think that this cannot be easily implemented in iXML or regular expressions. I agree with @rhdunn that if fn:grapheme should enter the QT ecosystem, it has to be done on the level of implementation.

I think that the very extensive test suite provided by Unicode (no. 1 above) provides exactly what would be needed to ensure accurate implementation. I'd be willing to convert the Unicode test suite into QT4 tests.

I think the more significant question is whether implementers of the QT 4.0 specs believe that their effort is worthwhile. There are several possible strategies an implementer could use to apply the rules.

Personally, I believe fn:grapheme has the potential to greatly help underserved communities. The communities that would benefit include those who use:

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

gimsieke commented 6 months ago

Won’t the BreakIterators that ICU4J provides help implementers in the Java realm?

liamquin commented 6 months ago

On Mon, 2024-02-26 at 23:05 -0800, Gerrit Imsieke wrote:

Won’t the BreakIterators that ICU4J provides help implementers in the Java realm?

Yes. And in C# and C++. There's also code for identifying grapheme clusters in harfbuzz, usable from C and C++ directly.

There's some additional complexity in that e.g. SIL Graphite (e.g. in OpenOffice/LibreOffice) does shaping at the font level; the Unicode algorithm (I'm told) isn't adequate in all cases. If it becomes necessary i can find more details. But in practice the ICU BreakIterators or the harfbuzz hb_shape function seem to be what most people use. But either way it likely adds a dependency.

I agree, however, it'd be a useful addition.

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

ChristianGruen commented 6 months ago

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

We’d be willing to provide an implementation. The function can be compared with fn:parse-html: It’s too complex to provide a custom implementation, but if a library is available that does the actual work (ICU, in our case), it will be easy to embed it, and to enable the function if the library is found in the classpath.

michaelhkay commented 6 months ago

Similarly to Christian, if tests are available and an ICU library implementation is available, then it's not a major cost to add a function that wraps the ICU implementation.

rhdunn commented 6 months ago

Yes, I would expect this functionality to be implementable by wrapping the ICU functionality, or other Unicode library that implements the relevant TR logic. This is about exposing that capability to XPath, XSLT, and XQuery.

ChristianGruen commented 3 months ago

Accepted