rivo / uniseg

Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go
MIT License
585 stars 61 forks source link

Emoji detection #27

Closed ivanjaros closed 1 year ago

ivanjaros commented 1 year ago

Since there is the emojiPresentation map, could this library be extended to detect emojis? I have a use case where I want to remove emojis from text but due to lack of options it seems I have to use the github.com/forPelevin/gomoji, which uses this library, but it has the entire emoji db that is 1.25MB map that needs to be loaded in memory, which I am not liking. Hence my question.

rivo commented 1 year ago

I suppose uniseg could help you do that. However, you would need to copy some code over to your own project, including the grapheCodePoints and emojiPresentation tables (although graphemeCodePoints could be greatly reduced to only include the relevant emoji code points), because I'm not planning on making these internal functions and tables public.

You can take a look at FirstGraphemeClusterInString() and runeWidth(). These functions need to detect emojis to calculate a width of 2 for them. So this is what I would do:

  1. Use uniseg to break string into grapheme clusters.
  2. For each grapheme cluster, check the returned width. If width ≠ 2, it's not an emoji.
  3. Check all runes in grapheme cluster:
    1. If a rune is the "Variation Selector-16", it's an emoji.
    2. If the first rune is a regional indicator (i.e. country flags) , it's an emoji.
    3. If the first rune is an extended pictographic, it may be an emoji. Check the emojiPresentation table. If it gives you the "emoji presentation" flag, it's an emoji.

This procedure considers ♫ not an emoji. If you want to eliminate these, too, then it's a bit different (and simpler, because you wouldn't need the emojiPresentation table or the check for the "Variation Selector-16", and emojis could have a width of 1).

ivanjaros commented 1 year ago

thanks, i'll give it a try.

aymanbagabas commented 4 months ago

Hey @rivo, I've stumbled upon this, and I'm trying to detect emojis without copying any code from uniseg with this function, the only thing that i'm missing is checking the extended pictographic property.

// see https://github.com/rivo/uniseg/issues/27
func isEmojiCluster(w int, runes []rune) bool {
    if w != 2 {
        return false
    }
    if len(runes) > 0 && runes[0] >= regionalIndicatorA && runes[0] <= regionalIndicatorZ {
        return true
    }
    for r := range runes {
        if r == variationSelector16 {
            return true
        }
    }
    // TODO: detect extended pictographic property
    return false
}

Would you be ok with adding IsEmoji(width int, b []byte) bool and IsEmojiInString(width int, str string) bool to uniseg? I can send a PR for this

mikelorant commented 4 months ago

This would be a great addition and hope this might be considered for merge.