haskell / text-icu

This package provides the Haskell Data.Text.ICU library, for performing complex manipulation of Unicode text.
BSD 2-Clause "Simplified" License
47 stars 41 forks source link

Getting the size of a grapheme cluster #97

Open Kleidukos opened 10 months ago

Kleidukos commented 10 months ago

I'd like to get the size of a grapheme cluster (from a value of type Text). Is there a function in the library that can help me with it? If not, is it in the scope of the library to provide one?

vshabanov commented 9 months ago

I'm not even sure what does the "size of a grapheme cluster" mean.

There are various ways to normalize text (compose/decompose grapheme clusters) https://hackage.haskell.org/package/text-icu-0.8.0.3/docs/Data-Text-ICU-Normalize2.html

Maybe unorm2_composePair() can help to compose those clusters and get their size.

Kleidukos commented 9 months ago

I'm not even sure what does the "size of a grapheme cluster" mean.

It's the operation that gives the length in graphemes, not code points. For example, the length of this grapheme cluster: "🤦🏼‍♂️" is 1.

This is an interesting problem, there's a short read about it here: https://tonsky.me/blog/unicode/

andreasabel commented 9 months ago

@Kleidukos In Agda we use cluster counting as linked below, is that what you are looking for? https://github.com/agda/agda/blob/4c5501e369b63ff3eabdbb3217db59904baf0e78/src/full/Agda/Interaction/Highlighting/LaTeX/Base.hs#L708-L716 length . ICU.breaks (ICU.breakCharacter ICU.Root)

Kleidukos commented 9 months ago

Oh yeah definitely! I'm quite surprised it's not offered by the library directly. Thanks @andreasabel!