unicode-rs / unicode-segmentation

Grapheme Cluster and Word boundaries according to UAX#29 rules
https://unicode-rs.github.io/unicode-segmentation
Other
565 stars 57 forks source link

Please add Tailored grapheme clusters/more support for complex scripts like Devanagari and other Indic scripts #111

Closed mah1212 closed 1 year ago

mah1212 commented 1 year ago

I am new to Rust and I was trying to split Devanagari (vowels and) bi-tri and tetra conjunct consonants as a whole while keeping the vowel sign and virama. And later map them to another Indic script.

I have used grapheme clusters in my current code, but it does not give me the desired output.

Here's what I wrote to split:

extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;

fn main() {

    let hs = "हिन्दी मुख्यमंत्री हिमंत";
    let hsi = hs.graphemes(true).collect::<Vec<&str>>();
    for i in hsi { 
        print!("{}  ", i); // double space for eye comfort
    }
} 

Current output:
हि न् दी मु ख् य मं त् री हि मं त

Desired ouput:
हि न्दी मु ख्य मं त्री हि मं त

I have posted this on SO, in more detail.

According to Unicode® Standard Annex #29 they have "Tailored grapheme clusters" which is sort of similar to the problem above (if not exactly).

Is there any way to achieve this using current implementation of unicode-segmentation? Else, it would be great to have this functionality for complex scripts like Indic scripts.

Manishearth commented 1 year ago

Hmm, so while I've been hoping for akshara support in UAX 29 for a while, it's an extremely nuanced topic, which has been proposed multiple times before with many challenges (here's a past spec proposal)

This crate is an attempt to implement that specification, and tailoring is out of scope (since tailoring is not specified).

The right place for this would be in ICU4X, however, ICU4X consumes CLDR data and CLDR does not yet have data for tailoring aksharas. There's a proposal here but it seems to have stalled (I can ask around to see if there's been progress).

You can use custom data with ICU4X to introduce akshara breaking by adding a virama-breaking rule to grapheme.toml and regenerating data. This is fiddly and I would not recommend it right now.

One of the main challenges is that different indic scripts have different attitudes around this: not all are happy with conjuncts, many default to not having conjuncts except for some specific cases. This means it's hard for Unicode to come up with a single solution here and the current tendency is to ask users to figure out what they need and do the tailoring themselves.

In your case, you can build an akshara-handling tailored segmenter on top of this segmenter by wrapping it in an iterator that catches segments ending in viramas and glues them to the next one provided they start with a letter.

mah1212 commented 1 year ago

Thank you for the detailed explanation on the challenges of Akshara support in UAX 29. I appreciate your suggestions.