Feature Request: Emoji Parser

colinodell commented 4 years ago

See https://github.com/thephpleague/commonmark-extras/issues/19

iksaku commented 4 years ago

Which would be a good way to keep emojis’ Unicode/image mapping?

I’m currently thinking of two possible ways, considering that we don’t want to hook up to Github for this:

Having the map hard coded in class constants
Have them available in config options for merge/override (I kinda prefer this one)

Any other ideas? Maybe I could send a PR with my personal implementation (which uses config options ATM)

EDIT: Also, I think having Unicode characters mapped instead of images is better for maintainability.

glensc commented 4 years ago

Maybe have some third party composer package for the Unicode mappings. it could be then updated by its own schedule, from GitHub or whatever sources, and provide clean API to access the mapping.

Also, I'm pretty sure it should be done as Unicode output, if someone wants to extend to use images, they can do that top of the conversion, probably just on client-side:

<span class="emoji" data-emoji-codepoint="1F60A" data-emoji-shortcode="blush">&#x1F60A;</span>

I think GitLab at some point struggle with this, and they emit in HTML, and replace it with images only if the browser doesn't support colored emojis. need some digging in their issue or merge requests or maybe they even wrote a blog post.

UPDATE (found GitLab docs):

https://docs.gitlab.com/ee/user/markdown.html#emoji

Most emoji are natively supported on macOS, Windows, iOS, Android and will fallback to image-based emoji where there is lack of support.
https://docs.gitlab.com/ee/development/fe_guide/emojis.html

glensc commented 4 years ago

Seems GitLab uses emojione:

https://github.com/bonusly/gemojione

which php variant is:

https://packagist.org/packages/emojione/emojione

but that's superseded by:

https://github.com/joypixels/emoji-toolkit

colinodell commented 4 years ago

Dumping some of my initial thoughts here:

The parser should look for both Unicode character and :+1:-type syntax create an Emoji inline element in the AST
The renderer will be responsible for determining whether to render those as images, Unicode characters, or something like @glensc's  examples. Not 100% sure on the image part, but if so, we should make it easy for people to plug in their own image set or provide a sane default.
We need the ability to map back-and-forth between Unicode characters and emoji names so that both :+1: and 👍 end up with similar (if not identical) Emoji element representations
What about support for custom emojis? For example, I might want :partyparrot:. How can we support that? What happens if the user chooses to render the Emoji elements as Unicode characters - what should happen to this emoji here?

iksaku commented 4 years ago

About Initial Short Code availability

For a default emoji code map, we could grab the short codes presented with the Github Emoji API and compile them down in a default config array. These emoji short codes are standard across Discord, Github, Gitlab and Slack as far as I'm aware of, and could be present in many more tools.

Github provides the codepoint sequence of each emoji in the URL of their images. We could ignore those images that doesn't contain a /unicode/ string in their image path, so we ensure we don't mix custom Github emoji images.

Github API's image names present in their path the Hex value of the intended emoji. For example, the url for :grinning: (:grinning:) is https://github.githubassets.com/images/icons/emoji/unicode/1f600.png?v8, which lets us know that the Hex code for this emoji is 1F600.

A more complex name example could be the :family_man_woman_girl_boy: :family_man_woman_girl_boy:: https://github.githubassets.com/images/icons/emoji/unicode/1f468-1f469-1f467-1f466.png?v8, which present the following hex values 1F468, 1F469, 1F467 and 1F466.

Also, we should remember that complex Emoji sequences require the use of unicode's Zero Width Joiner (ZWJ), whose codepoint is U+200D. This is used to join individual emojis into a combined representation (like above, that we join a Man, Woman, Girl and Boy into a Family-specific representation).

Regarding Emoji Rendering

Codepoint sequences can be dynamically created in PHP using Multibye String Functions, which allow to convert ordinal (integer) unicode values into PHP's character representation. Currently, there are two functions in PHP 7+ that allow this functionality, but they may not come by default, so proper extensions may be needed:

IntlChar::chr() function, which is present in ext-intl.
mb_chr() function, which is present in ext-mbstring.

I have tested both functions in my private Emoji Extension, and they behave similarly regarding emoji rendering, so either one would be fine, but the ext-mbstring is required already by this library, so we could go with that.

So the steps could be:

Extract the Hex value from Github's image URL string and explode them into hex string parts.
Each part should be converted from Hex to Decimal representation, we can use hexdec for this.
Each value is then converted into a multibyte string using mb_chr().
If the sequence provides more than one codepoint, append a ZWJ at the end of each codepoint.
Voilà! Emoji string is available.

Here is available an example code to render a :family_man_woman_girl_boy: from Github API Image Url, showing code and result.

Full code for future reference

```php $zwj = "\u{200D}"; // Reference, may come from 'foreach' loop $short_code = ':family_man_woman_girl_boy:'; $url = 'https://github.githubassets.com/images/icons/emoji/unicode/1f468-1f469-1f467-1f466.png?v8'; if (preg_match('/^((?!unicode).)*$/', $url)) { // Custom Github emoji return; } preg_match('/(?<=\/)[a-zA-Z0-9\-]+(?=\.png)/', $url, $matches); $parts = array_map( fn (string $part) => mb_chr(hexdec($part)), explode('-', $matches[0]) ); $unicode = implode($zwj, $parts); echo $unicode; ```

Extra Notes

GFM doesn't allow emoji rendering in code blocks:
- If in code block, :mexico: will not render as emoji... Like :mexico:
Current implementation of the library respects this matter, but I feel like specifying this in order to prevent future implementation corruption.
Recently discovered that there are two emoji representations:
- Text representation, which appends U+FE0E
- Color representation, which appends U+FE0F
Each one of these codes are appended at the end of the code sequence. If multiple representation codes are detected, the first one detected is applied (Chrome and Firefox present this behavior). By default, Emoji Color Fonts (Such as Apple Emoji or Noto Color) use Color representation, and depending on the fonts available, they can switch between text and color. I assume we will prefer to use Color representation (otherwise, why would we want to have emoji?).
I love @glensc's proposal of using a  element to contain emojis, this way we could append special attributes such as data-emoji-codepoint for reference, class for custom styling and even alt or aria labels.

Another perk of using  elements is that, in the case that a rendered emoji has a trailing ZWJ codepoint, consecutive emojis will not be mixed, instead, they will be kept isolated one from another.

Edit

Fixed example link and code to prevent trailing ZWJ codepoints. Elaborated a bit more on using  element container.

colinodell commented 4 years ago

Thank you for that detailed analysis, @iksaku! And for pointing out some of the edge cases we need to be careful about. Overall, I think this is the right approach.

markhalliwell commented 4 years ago

I think this can be slated for 2.0 (or later)

markhalliwell commented 3 years ago

Upon doing lots of research around this topic, I ultimately came to the realization that there isn't that great of support for emojis in PHP at all.

There's great support for emojis on the JavaScript side of things (https://emojibase.dev), however not so much on the PHP side of things.

I was originally planning on using https://github.com/elvanto/litemoji, but it's static based and didn't really deal with the entire node API.

So I initially forked that project, but ultimately created a new one because it was nothing similar to the original fork at all towards the end. I realized that what was needed was a proper wrapper of that node module (no sense in trying to recreate the wheel here) and created the following PHP project: https://packagist.org/packages/unicorn-fail/emoji

It's essentially just a PHP based parser/converter API that utilizes the JS data from the node module into PHP objects that are serialized and gzipped (collections) for all the locales and presets variations.

While the code is fully tested and 100% covered, I haven't created a release yet because it doesn't (yet) cover all the use cases described above (i.e. use of images for output, custom emojis, etc.). I wasn't sure if it should or that could be something we work on in that project as a feature later down the road.

For now, I think the creation of this project however can start to allow us to get closer to creating a proper extension for this project now.

iksaku commented 3 years ago

We could start by going with a class-based approach, in which we could define constant unicodes and map them to specific shortcuts, like Github's. This way, we could keep track of which codes are being used and which shortcuts are available and would serve as a starting point.

Upon implementing the Project emoji parser and renderer, developers could intercept the parsing phase and provide their custom emoji implementations if they don't want to use unicode, and instead they want to use Twemoji for example.

If developers want to extend their unicode list, they could inject custom phrases in the provided class-list, and let the default renderer implementation run as usual.

colinodell commented 3 years ago

We could start by going with a class-based approach, in which we could define constant unicodes and map them to specific shortcuts, like Github's. This way, we could keep track of which codes are being used and which shortcuts are available and would serve as a starting point.

We started there but found it had some major limitations:

It makes Github authoritative on emojis, when they're not
It wouldn't support emoticons or HTML entities
It wouldn't provide any additional metadata that might be useful, like skin tone variations or alternate aliases

@markcarver is working on a much more robust implementation that will provide all of that and more :)

Upon implementing the Project emoji parser and renderer, developers could intercept the parsing phase and provide their custom emoji implementations if they don't want to use unicode, and instead they want to use Twemoji for example.

The actual "parsing" will not be done with a parser; rather, we'll look in the parsed AST for any Text elements. The contents of those will be run through an external library which will separate out the plain text from emoji-like things (emoji, emoticons, etc). We'll then replace the original Text node with one or more Text and/or Emoji nodes containing what was found.

The extension will also provide an inline renderer which will take those Emoji nodes and render them as needed.

This approach will support UTF-8-encoded emoji, emoji shortcodes, emoticons, and HTML entities.

I'm not entirely sure how custom libraries would be implemented at the moment, though the goal would certainly be to allow those somehow - I just don't know exactly how that would function just yet. I'd like to see how we're able to integrate @markcarver's functionality before making that determination. But rest assured we do want to make that possible :)

colinodell commented 1 year ago

Thinking about this more, I believe we need the following features/components in our implementation:

AST Nodes

We'll need two AST nodes:

A UnicodeEmoji node to represent emoji defined by the Unicode standard. It should store the corresponding grapheme cluster (and the shortcode, if known)
A CustomEmoji node to represent "custom emoji" not defined by Unicode - things like :octocat: or :shipit:. It should only store the shortcode.

Parsers

An inline parser for the :shortcode: syntax, matching that against some user-provided list of known shortcodes (see "Providers") and inserting a UnicodeEmoji node into the AST if found (or a CustomEmoji node otherwise)
An optional processor that iterates parsed Text and replaces any raw emoji sequences with UnicodeEmoji objects. Users may enable this if they want to send all emoji through a custom renderer - perhaps because they want to use a custom image set. (This would be disabled by default due to the performance overhead)
Another optional processor for Text that replaces emoticons like :-) with UnicodeEmoji

Renderers

A default renderer which renders UnicodeEmoji as the raw UTF-8 bytes for that emoji sequence (perhaps decorated in a  as mentioned earlier), and CustomEmoji as the original :shortcode: syntax
The ability for users to provide their own custom renderer which can output custom HTML for UnicodeEmoji or CustomEmoji

Providers

Because the list of shortcodes and their rendered representation is closely connected (especially for custom emoji), perhaps we'll have an EmojiProviderInterface which facilitates both the shortcode lookup and rendering. We could then provide a variety of default implementations like GitHubEmojiProvider or SlackEmojiProvider so users can easily choose their preferred flavor of shortcodes or implement their own for custom emoji.

☝️ This is subject to change but I think it's a good starting point that provides decent separation of concerns.

iksaku commented 1 year ago

Loving the AST part and the idea for the ability to use different emoji providers.

One thing I think would also be needed is a way to track when new short codes are added to each provider, say due to an Emoji spec update and providers slowly rolling out the new emojis. Thoughts?

colinodell commented 1 year ago

One thing I think would also be needed is a way to track when new short codes are added to each provider, say due to an Emoji spec update and providers slowly rolling out the new emojis. Thoughts?

Part of me thinks that most users probably don't care about knowing when that emoji became available, and may not want or need version information exposed programmatically through code - simply providing notes in the CHANGELOG or docblocks might be enough for them.

In other words, we'd provide just enough functionality for users to "just use" emoji out-of-the-box, but also provide the ability for advanced users to supply their own provider (or write a simple adapter for another Packagist library) if they need different shortcodes or support for brand new emoji.

If you have a more advanced use-case in mind I'd love to hear about it! I'm definitely open to different ideas here :)

iksaku commented 1 year ago

Oh, I meant have some way to “know” when Slack (i.e) adds this new emoji with a short code, so that it is added in the SlackEmojiProvider, then just make it known in change logs 🙂

dkarlovi commented 6 months ago

This is probably known already, but in the meantime Symfony added an Emoji transliterator which could be used to do this I think? https://symfony.com/doc/current/components/intl.html#emoji-transliteration

thephpleague / commonmark