mabe02 / lanterna

Java library for creating text-based GUIs
GNU Lesser General Public License v3.0
2.23k stars 243 forks source link

Feature request: emoji support #505

Open atoktoto opened 3 years ago

atoktoto commented 3 years ago

The code textGraphics.putString(0, 3, "🍕") results in two question mark characters being displayed in the terminal. At the same time System.out.println("🍕") works as intended (at least in terminal emulators supporting emojis: ie. the new Windows Terminal).

Is it possible to support this use-case? I guess the emoji codepoints do not fit into char type that is used in Terminal.putCharacter so this would require major changes.

rednoah commented 3 years ago

Assuming that you're talking about Windows, printing Emoji to Terminal works already via the WriteConsole native call in WindowsConsoleOutputStream. But only the new Windows Terminal is capable by default to actually display Emoji. CMD and PowerShell will just display a box.

That being said, full support for grapheme clusters (i.e. user-perceived characters) would be appreciated not for Emoji, but for all the non-Latin languages where a series of code points coalesce into a single-width user-perceived character.

rednoah commented 3 years ago

com.ibm.icu.text.BreakIterator can be used to iterate grapheme clusters (i.e. single-width user-perceived characters) though this will require icu4j as an additional dependency.

import com.ibm.icu.text.BreakIterator;
public static List<String> getGraphemeClusters(String self) {
    List<String> characters = new ArrayList<String>(self.length());
    BreakIterator i = BreakIterator.getCharacterInstance();
    i.setText(self);
    for (int begin = 0, end = 0; (end = i.next()) != BreakIterator.DONE; begin = i.current()) {
        characters.add(self.substring(begin, end));
    }
    return characters;
}

The JDK built-in java.text.BreakIterator may or may not work well depending on the specific use case. It'll work for Asian languages (e.g. บุฟเฟต์) but won't work Emoji sequences (e.g. 👩‍👩‍👦‍👦).

atoktoto commented 3 years ago

Seems certainly doable but would require a significant change (or addition) to the Lanterna interfaces. Currently, char and TextCharacter seems to be the center of the whole operation. Replacing it with String representing a single grapheme cluster seems wasteful (in terms of memory) for the general case and can make it less legible: void putCharacter(String s) looks wrong :D

Also, Windows Terminal only displays emoji correctly if running a WSL session

mabe02 commented 3 years ago

I don't see why emoji wouldn't work with the current system, given that we can do CJK characters just fine. I'll investigate, maybe it's the terminal encoding that needs to be updated.

rednoah commented 3 years ago

Here's what I get with lanterna 3.0.3 for a file name such as THAIบุฟเฟต์EMOJI👩‍👩‍👦‍👦.txt:

Screen Shot 2020-08-30 at 10 02 12

That being said, neither iTerm nor Terminal render this particular Emoji correctly either:

$ ls
THAIบุฟเฟต์EMOJI👩?👩?👦?👦.txt
rednoah commented 3 years ago

EDIT: บุฟเฟต์ does render correctly, but the layout does not account for compound characters บุ and ต์ taking up only 1 character (even though it's 2 code points each) and so the layout is off by 2 here: Screen Shot 2020-08-30 at 10 10 32

$ ls
TEST.mp4
บุฟเฟต์.mp4

CKJ works because those are 1 code point per character, i.e. is 1 code point, but บุ is 2 code points which are composed into a single logical character by the text renderer.

mabe02 commented 3 years ago

Interesting, so the CJK detector incorrectly flags บุ as two text characters wide?

mabe02 commented 3 years ago

Ok, I see the problem now. Java char type isn't able to store emoji: https://developers.redhat.com/blog/2019/08/16/manipulating-emojis-in-java-or-what-is-%F0%9F%90%BB-1/ Slightly unexpected. Will see what we can do about this.

rednoah commented 3 years ago

Yes, บุ is 2 code points. It even requires hitting DELETE twice to delete the entire character. Hitting DELETE once only changes บุ to . Kinda like NFD except there is no NFC for บุ.

mabe02 commented 3 years ago

The problem I'm finding is that even "บุ".length() returns 2... I'm trying to change the internal representation of TerminalCharacter to String, but it's tricky to know if the character should be considered single- or double-width, given Java provides little guidance. I'd like to avoid hard-coding unicode page references if possible...

mabe02 commented 3 years ago

Have been browsing articles and it really seems like while we can get the number of code points, there's no way to know if these code points are combined into a single character, or if that character is double or single width!

rednoah commented 3 years ago

Yep, pretty much. lanterna effectively can't predict how a terminal window is going to render the text, because it depends on the version of unicode used by the text renderer. Though we can generally assume that long-established unicode sequences like บุ will work just fine, while recent additions like 👩‍👩‍👦‍👦 are likely to not work.

You can use the java.text.BreakIterator to split a String into "display characters" like so:

public static List<String> getGraphemeClusters(String self) {
    List<String> characters = new ArrayList<String>(self.length());
    BreakIterator i = BreakIterator.getCharacterInstance();
    i.setText(self);
    for (int begin = 0, end = 0; (end = i.next()) != BreakIterator.DONE; begin = i.current()) {
        characters.add(self.substring(begin, end));
    }
    return characters;
}

java.text.BreakIterator and com.ibm.icu.text.BreakIterator can be used interchangeably. java.text.BreakIterator has the advantage of being a JDK built-in class. com.ibm.icu.text.BreakIterator has the advantage of working better for recent unicode additions (i.e. complex compound emoji sequences; notably probably something your terminal window won't display correctly anyway).

It might make sense to make the BreakIterator configurable:

mabe02 commented 3 years ago

Ok, so here's what we'll do. In 3.0 we'll restrict TextCharacter to BMP only, with an override if you really know what you're doing. In 3.1, also restrict but change to use String internally and let you supply your own "String" character for complicated emoji. Will try this out.

mabe02 commented 3 years ago

Ok, I misunderstood the BMP plane again. I've just blocked 3.0 from creating TextCharacters from surrogate char:s at least. So next will use the BreakIterator above to in 3.1 to try to group characters.

mabe02 commented 3 years ago

Okay, I've re-worked TextCharacter to support this: PR for review: https://github.com/mabe02/lanterna/pull/508

mabe02 commented 3 years ago

Ok, code is merged. If you clone and build release/3.1 (I'll do another release in a week or so) you should be able to print emoji as double-width and your magic บุ character only occupying one column. Please try it out and report back before I close this.

MVoloshin commented 2 years ago

@mabe02 , cant print BOMB character "\uD83D\uDCA3" or 💣 using Lanterna 3.2.0-master on Windows 7 x64 (SwingTerminalWindow). I just get two rectangles(