koka-lang / koka

Koka language compiler and interpreter
http://koka-lang.org
Other
3.32k stars 165 forks source link

std/text/unicode graphemes does not return a list of grapheme clusters to iterate #458

Open erf opened 9 months ago

erf commented 9 months ago

I thought graphemes("hi❤️‍🔥")

would return the list: ["h", "i", "❤️‍🔥"], a list of grapheme clusters that i could iterate with:

  l.foreach fn(c)
    println(c)

which would print out single grapheme clusters like:

h
i
❤️‍🔥

also if i print l.length now it returns 6, i wish there was a function which would return the number of grapheme clusters like 3 in this case.

I'm new to koka and these libraries so sorry if i've mistaken the usage.

This Dart Characters package might be inspiration

TimWhiting commented 9 months ago

See https://github.com/koka-lang/koka/blob/f48555f1e2ba211f3e86524a15668597cad19118/lib/std/text/unicode.kk#L56 for the details on how koka currently reports graphemes.

Note that strings already are utf16, and characters are utf16 code points. So doing string.list gives you characters at that granularity.

Here are some adjustments to the current that I think gives you what you want:

import std/text/unicode

// Join combining characters with their base into a grapheme.
fun join-combining( cs : list<char>, comb : list<char> = [], acc : list<grapheme> = []) : list<grapheme> {
  match(cs) {
    Cons(zwj, cc) | zwj.int == 0x200D -> // Add zero width joiner
      match cc
        Cons(c, cc') -> cc'.join-combining(Cons(c, Cons(zwj,comb)), acc)
        Nil -> cc.join-combining(Cons(zwj, comb), acc)
    Cons(c,cc) -> if (c.is-combining2)
                   then cc.join-combining( Cons(c,comb), acc )
                   else cc.join-combining( [c], consrev(comb,acc) )
    Nil        -> consrev(comb,acc).reverse
  }
}
fun consrev(xs,xss) {
  if (xs.is-nil) then xss else Cons(xs.reverse.string,xss)
}

pub fun is-combining2( c : char ) : bool {
  val i = c.int
  ((i >= 0x0300 && i <= 0x036F) ||
   (i >= 0x1AB0 && i <= 0x1AFF) ||
   (i >= 0x1DC0 && i <= 0x1DFF) ||
   (i >= 0x20D0 && i <= 0x20FF) ||
   (i >= 0xFE20 && i <= 0xFE2F) ||
   (i >= 0xFE00 && i <= 0xFE0F)) // Added variation selectors
}

fun main()
  "Utf16 code points".println
  "hi❤️‍🔥".list.map(show).join(",").println
  "NFC".println // This is the normalization that graphemes gives you
  "hi❤️‍🔥".normalize(NFC).list.join-combining.join(",").println
  "NFD".println
  "hi❤️‍🔥".normalize(NFD).list.join-combining.join(",").println
  "NFKC".println
  "hi❤️‍🔥".normalize(NFKC).list.join-combining.join(",").println
  "NFKD".println
  "hi❤️‍🔥".normalize(NFKD).list.join-combining.join(",").println

All of the different normalization schemes give the same result in this case. I added the zero width joiner to the join-combining function and added variation selectors to the is-combining2 function. I'll have to talk to Daan to see if this is the intended operation of graphemes.

From the api description copied below it is not clear if self-contained symbol would mean to keep the heart / fire and variation selector separate or not: It seems to me that since the variation selectors and zero width joiner do not have any character representation that the above changes should be incorporated. Either way, at minimum I think there should be changes made to make join-combining a public function and have a variant that combines all non-representable (visual) code-points.

// Grapheme's are an alias for `:string`.
// Each grapheme is a self-contained symbol consisting of
// a unicode character followed by combining characters and/or
// combining marks.
pub alias grapheme = string