Open colinodell opened 4 years ago
Which would be a good way to keep emojis’ Unicode/image mapping?
I’m currently thinking of two possible ways, considering that we don’t want to hook up to Github for this:
Any other ideas? Maybe I could send a PR with my personal implementation (which uses config options ATM)
EDIT: Also, I think having Unicode characters mapped instead of images is better for maintainability.
Maybe have some third party composer package for the Unicode mappings. it could be then updated by its own schedule, from GitHub or whatever sources, and provide clean API to access the mapping.
Also, I'm pretty sure it should be done as Unicode output, if someone wants to extend to use images, they can do that top of the conversion, probably just on client-side:
<span class="emoji" data-emoji-codepoint="1F60A" data-emoji-shortcode="blush">😊</span>
I think GitLab at some point struggle with this, and they emit in HTML, and replace it with images only if the browser doesn't support colored emojis. need some digging in their issue or merge requests or maybe they even wrote a blog post.
UPDATE (found GitLab docs):
Most emoji are natively supported on macOS, Windows, iOS, Android and will fallback to image-based emoji where there is lack of support.
Seems GitLab uses emojione:
which php variant is:
but that's superseded by:
Dumping some of my initial thoughts here:
:+1:
-type syntax create an Emoji
inline element in the AST<span>
examples. Not 100% sure on the image part, but if so, we should make it easy for people to plug in their own image set or provide a sane default.:+1:
and 👍
end up with similar (if not identical) Emoji
element representations:partyparrot:
. How can we support that? What happens if the user chooses to render the Emoji
elements as Unicode characters - what should happen to this emoji here?For a default emoji code map, we could grab the short codes presented with the Github Emoji API and compile them down in a default config array. These emoji short codes are standard across Discord, Github, Gitlab and Slack as far as I'm aware of, and could be present in many more tools.
Github provides the codepoint sequence of each emoji in the URL of their images. We could ignore those images that doesn't contain a /unicode/
string in their image path, so we ensure we don't mix custom Github emoji images.
Github API's image names present in their path the Hex value of the intended emoji. For example, the url for :grinning: (:grinning:
) is https://github.githubassets.com/images/icons/emoji/unicode/1f600.png?v8, which lets us know that the Hex code for this emoji is 1F600
.
A more complex name example could be the :family_man_woman_girl_boy: :family_man_woman_girl_boy:
: https://github.githubassets.com/images/icons/emoji/unicode/1f468-1f469-1f467-1f466.png?v8, which present the following hex values 1F468
, 1F469
, 1F467
and 1F466
.
Also, we should remember that complex Emoji sequences require the use of unicode's Zero Width Joiner (ZWJ), whose codepoint is U+200D
. This is used to join individual emojis into a combined representation (like above, that we join a Man, Woman, Girl and Boy into a Family-specific representation).
Codepoint sequences can be dynamically created in PHP using Multibye String Functions, which allow to convert ordinal
(integer
) unicode values into PHP's character representation.
Currently, there are two functions in PHP 7+ that allow this functionality, but they may not come by default, so proper extensions may be needed:
IntlChar::chr()
function, which is present in ext-intl
.mb_chr()
function, which is present in ext-mbstring
.I have tested both functions in my private Emoji Extension, and they behave similarly regarding emoji rendering, so either one would be fine, but the ext-mbstring
is required already by this library, so we could go with that.
So the steps could be:
hexdec
for this.mb_chr()
.Here is available an example code to render a :family_man_woman_girl_boy: from Github API Image Url, showing code and result.
GFM doesn't allow emoji rendering in code
blocks:
:mexico:
will not render as emoji... Like :mexico:Current implementation of the library respects this matter, but I feel like specifying this in order to prevent future implementation corruption.
Recently discovered that there are two emoji representations:
U+FE0E
U+FE0F
Each one of these codes are appended at the end of the code sequence. If multiple representation codes are detected, the first one detected is applied (Chrome and Firefox present this behavior). By default, Emoji Color Fonts (Such as Apple Emoji or Noto Color) use Color representation, and depending on the fonts available, they can switch between text and color. I assume we will prefer to use Color representation (otherwise, why would we want to have emoji?).
I love @glensc's proposal of using a <span>
element to contain emojis, this way we could append special attributes such as data-emoji-codepoint
for reference, class
for custom styling and even alt
or aria
labels.
Another perk of using <span>
elements is that, in the case that a rendered emoji has a trailing ZWJ codepoint, consecutive emojis will not be mixed, instead, they will be kept isolated one from another.
<span>
element container.Thank you for that detailed analysis, @iksaku! And for pointing out some of the edge cases we need to be careful about. Overall, I think this is the right approach.
I think this can be slated for 2.0 (or later)
Upon doing lots of research around this topic, I ultimately came to the realization that there isn't that great of support for emojis in PHP at all.
There's great support for emojis on the JavaScript side of things (https://emojibase.dev), however not so much on the PHP side of things.
I was originally planning on using https://github.com/elvanto/litemoji, but it's static based and didn't really deal with the entire node API.
So I initially forked that project, but ultimately created a new one because it was nothing similar to the original fork at all towards the end. I realized that what was needed was a proper wrapper of that node module (no sense in trying to recreate the wheel here) and created the following PHP project: https://packagist.org/packages/unicorn-fail/emoji
It's essentially just a PHP based parser/converter API that utilizes the JS data from the node module into PHP objects that are serialized and gzipped (collections) for all the locales and presets variations.
While the code is fully tested and 100% covered, I haven't created a release yet because it doesn't (yet) cover all the use cases described above (i.e. use of images for output, custom emojis, etc.). I wasn't sure if it should or that could be something we work on in that project as a feature later down the road.
For now, I think the creation of this project however can start to allow us to get closer to creating a proper extension for this project now.
We could start by going with a class-based approach, in which we could define constant unicodes and map them to specific shortcuts, like Github's. This way, we could keep track of which codes are being used and which shortcuts are available and would serve as a starting point.
Upon implementing the Project emoji parser and renderer, developers could intercept the parsing phase and provide their custom emoji implementations if they don't want to use unicode, and instead they want to use Twemoji for example.
If developers want to extend their unicode list, they could inject custom phrases in the provided class-list, and let the default renderer implementation run as usual.
We could start by going with a class-based approach, in which we could define constant unicodes and map them to specific shortcuts, like Github's. This way, we could keep track of which codes are being used and which shortcuts are available and would serve as a starting point.
We started there but found it had some major limitations:
@markcarver is working on a much more robust implementation that will provide all of that and more :)
Upon implementing the Project emoji parser and renderer, developers could intercept the parsing phase and provide their custom emoji implementations if they don't want to use unicode, and instead they want to use Twemoji for example.
The actual "parsing" will not be done with a parser; rather, we'll look in the parsed AST for any Text
elements. The contents of those will be run through an external library which will separate out the plain text from emoji-like things (emoji, emoticons, etc). We'll then replace the original Text
node with one or more Text
and/or Emoji
nodes containing what was found.
The extension will also provide an inline renderer which will take those Emoji
nodes and render them as needed.
This approach will support UTF-8-encoded emoji, emoji shortcodes, emoticons, and HTML entities.
I'm not entirely sure how custom libraries would be implemented at the moment, though the goal would certainly be to allow those somehow - I just don't know exactly how that would function just yet. I'd like to see how we're able to integrate @markcarver's functionality before making that determination. But rest assured we do want to make that possible :)
Thinking about this more, I believe we need the following features/components in our implementation:
We'll need two AST nodes:
UnicodeEmoji
node to represent emoji defined by the Unicode standard. It should store the corresponding grapheme cluster (and the shortcode, if known)CustomEmoji
node to represent "custom emoji" not defined by Unicode - things like :octocat:
or :shipit:
. It should only store the shortcode.:shortcode:
syntax, matching that against some user-provided list of known shortcodes (see "Providers") and inserting a UnicodeEmoji
node into the AST if found (or a CustomEmoji
node otherwise)Text
and replaces any raw emoji sequences with UnicodeEmoji
objects. Users may enable this if they want to send all emoji through a custom renderer - perhaps because they want to use a custom image set. (This would be disabled by default due to the performance overhead)Text
that replaces emoticons like :-)
with UnicodeEmoji
UnicodeEmoji
as the raw UTF-8 bytes for that emoji sequence (perhaps decorated in a <span>
as mentioned earlier), and CustomEmoji
as the original :shortcode:
syntaxUnicodeEmoji
or CustomEmoji
Because the list of shortcodes and their rendered representation is closely connected (especially for custom emoji), perhaps we'll have an EmojiProviderInterface
which facilitates both the shortcode lookup and rendering. We could then provide a variety of default implementations like GitHubEmojiProvider
or SlackEmojiProvider
so users can easily choose their preferred flavor of shortcodes or implement their own for custom emoji.
☝️ This is subject to change but I think it's a good starting point that provides decent separation of concerns.
Loving the AST part and the idea for the ability to use different emoji providers.
One thing I think would also be needed is a way to track when new short codes are added to each provider, say due to an Emoji spec update and providers slowly rolling out the new emojis. Thoughts?
One thing I think would also be needed is a way to track when new short codes are added to each provider, say due to an Emoji spec update and providers slowly rolling out the new emojis. Thoughts?
Part of me thinks that most users probably don't care about knowing when that emoji became available, and may not want or need version information exposed programmatically through code - simply providing notes in the CHANGELOG or docblocks might be enough for them.
In other words, we'd provide just enough functionality for users to "just use" emoji out-of-the-box, but also provide the ability for advanced users to supply their own provider (or write a simple adapter for another Packagist library) if they need different shortcodes or support for brand new emoji.
If you have a more advanced use-case in mind I'd love to hear about it! I'm definitely open to different ideas here :)
Oh, I meant have some way to “know” when Slack (i.e) adds this new emoji with a short code, so that it is added in the SlackEmojiProvider, then just make it known in change logs 🙂
This is probably known already, but in the meantime Symfony added an Emoji transliterator which could be used to do this I think? https://symfony.com/doc/current/components/intl.html#emoji-transliteration
See https://github.com/thephpleague/commonmark-extras/issues/19