Open erf opened 9 months ago
See https://github.com/koka-lang/koka/blob/f48555f1e2ba211f3e86524a15668597cad19118/lib/std/text/unicode.kk#L56 for the details on how koka currently reports graphemes.
Note that strings already are utf16, and characters are utf16 code points. So doing string.list
gives you characters at that granularity.
Here are some adjustments to the current that I think gives you what you want:
import std/text/unicode
// Join combining characters with their base into a grapheme.
fun join-combining( cs : list<char>, comb : list<char> = [], acc : list<grapheme> = []) : list<grapheme> {
match(cs) {
Cons(zwj, cc) | zwj.int == 0x200D -> // Add zero width joiner
match cc
Cons(c, cc') -> cc'.join-combining(Cons(c, Cons(zwj,comb)), acc)
Nil -> cc.join-combining(Cons(zwj, comb), acc)
Cons(c,cc) -> if (c.is-combining2)
then cc.join-combining( Cons(c,comb), acc )
else cc.join-combining( [c], consrev(comb,acc) )
Nil -> consrev(comb,acc).reverse
}
}
fun consrev(xs,xss) {
if (xs.is-nil) then xss else Cons(xs.reverse.string,xss)
}
pub fun is-combining2( c : char ) : bool {
val i = c.int
((i >= 0x0300 && i <= 0x036F) ||
(i >= 0x1AB0 && i <= 0x1AFF) ||
(i >= 0x1DC0 && i <= 0x1DFF) ||
(i >= 0x20D0 && i <= 0x20FF) ||
(i >= 0xFE20 && i <= 0xFE2F) ||
(i >= 0xFE00 && i <= 0xFE0F)) // Added variation selectors
}
fun main()
"Utf16 code points".println
"hi❤️🔥".list.map(show).join(",").println
"NFC".println // This is the normalization that graphemes gives you
"hi❤️🔥".normalize(NFC).list.join-combining.join(",").println
"NFD".println
"hi❤️🔥".normalize(NFD).list.join-combining.join(",").println
"NFKC".println
"hi❤️🔥".normalize(NFKC).list.join-combining.join(",").println
"NFKD".println
"hi❤️🔥".normalize(NFKD).list.join-combining.join(",").println
All of the different normalization schemes give the same result in this case. I added the zero width joiner to the join-combining
function and added variation selectors to the is-combining2
function. I'll have to talk to Daan to see if this is the intended operation of graphemes.
From the api description copied below it is not clear if self-contained symbol would mean to keep the heart / fire and variation selector separate or not: It seems to me that since the variation selectors and zero width joiner do not have any character representation that the above changes should be incorporated. Either way, at minimum I think there should be changes made to make join-combining
a public function and have a variant that combines all non-representable (visual) code-points.
// Grapheme's are an alias for `:string`.
// Each grapheme is a self-contained symbol consisting of
// a unicode character followed by combining characters and/or
// combining marks.
pub alias grapheme = string
I thought
graphemes("hi❤️🔥")
would return the list:
["h", "i", "❤️🔥"]
, a list of grapheme clusters that i could iterate with:which would print out single grapheme clusters like:
also if i print
l.length
now it returns6
, i wish there was a function which would return the number of grapheme clusters like 3 in this case.I'm new to
koka
and these libraries so sorry if i've mistaken the usage.This Dart Characters package might be inspiration