keianrao / EmoJiPicker

A quickly put together emoji picker in Java Swing
GNU General Public License v3.0
2 stars 0 forks source link

[Discussion] Saving emoji sequences #1

Closed keianrao closed 4 years ago

keianrao commented 4 years ago

I've read somewhat the Unicode Technical Standard #51, which explains the terms used in the data. Also, looking at the data files I've collected, I think the only one we will need is emoji-test, as it lists semantic groups as well as the code points for all of their members.

The problem is - Java's char is UTF-16 and only goes up to 0xFFFF, but emoji-test lists most of the emoji with code points above that.

This article by Red Hat mentions that lang.Character should have utility methods to help, and that if that fails I should look for "surrogate characters".

(One can also cheat by grabbing the start of the comments on each line - but comments shouldn't be relied on, and I think that will fail for emoji-modifiers, or if we try to add emoji-modifiers ourselves)

keianrao commented 4 years ago

I already have several changes planned, but right now I am going to look at lang.Character and also read briefly about surrogate characters.

Architecture-wise, there is nothing of concern yet - the data we are going to read should map cleanly to the typical emoji picker GUI

keianrao commented 4 years ago

"The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)"

keianrao commented 4 years ago

Alright. So, if we want to show the emoji in a Swing button, we need to provide it as a String.. which is a CharSequence which is in UTF-16.

As advertised, lang.Character does provide utility methods for converting those code points we see in emoji-test to UTF-16 surrogate pairs. Once we've assembled all the UTF-16 code units for the emoji, we can get a String out of it using lang.String#valueOf.

lang.Character puts forward a rather funny hack, the helper methods ask for basically UCS-4 values, using the int primitive type which is 32 bits.

keianrao commented 4 years ago

I think we will go ahead with that ("translating to Java's native char & string"). Because, the alternative would be to roll our own solution for keeping those code points. I have no ideas at the top of my mind for how to do so, besides bit arrays, so it's probably not a good route.