Emojis are encoded in unicode, just like most other characters. All emojis have unicode values that
can be used to refer to them. For example, the following code works
These values are well documented, and the difference we see in emojis accross platforms stems not
from differences in unicode values, but in platform specific ways of animating these symbols.
The Problem
Some emojis in the .htm files given are converted from their usual unicode representation to
python src encodings. The above laughing crying face has a src value
of "\U000fe334" , which cannot easily be used interchangably with their corresponding emojis.
There are two dictionaries in the file emoji_values.py, one that contains a mapping of python src
encodings to colloquial names of their corresponding emojis, and one that takes those names
and maps them to unicode values. For the emoji sample used thus far:
>>> print(emoji_values.SRC_CODES_TO_CAP_NAME[u'\U000fe334'])
'FACE WITH TEARS OF JOY'
>>> print(emoji_values.EMOJI_UNICODE['face_with_tears_of_joy'])
u'\U0001F602'
#Note: the fact that the first dictionary uses capitalized names with spaces while the second uses
# lowercase names with underscores is not important, this is just the way I found these libraries
# online. There are methods in `emojis.py` that convert between the two forms
The issue is that while the second list EMOJI_UNICODE is complete, the SRC_CODES list is missing
many emojis. As a result there are occasions when a python src encoding is encountered that is not
in the dictionary, and thus cannot be converted to unicode
The Solution
Since the unicode dictionary is more complete than the src encodings dictionary, we can use the
unicode dictionary to look up equivalent src encodings. The solution will essentially be to
go through every element in the EMOJI_UNICODE dictionary that doesn't have an equivilant in the
src encodings dictionary, and add an appropriate element. For example:
If we encounter a key and value from EMOJI_UNICODE without a corresponding src encoding
key/ value pair, say u':blue_heart:': u'\U0001F499' we could proceed to look up the unicode
value u'\U0001F499' and try to find a page that has src encodings values for this emoji.
This page contains the information we need. If you click on this link and scroll to the bottom of the encodings table, we can see that
the src encoding is u'\U000FEB13' (To find this try using ctr+f and searching for the code).
To finish up this emoji, we should add this information to the src codes dictionary, specifically
the line u'\U000feb13': u'BLUE HEART' would suffice. This maps the code u'\U000FEB13' to the
string "BLUE HEART" which can then be mapped to our existing unicode.
The above should be done for all elements in the EMOJI_UNICODE dictionary that is not in the
SRC_CODES_TO_CAP_NAMES.
Emojis
Summary
Emojis are encoded in unicode, just like most other characters. All emojis have unicode values that can be used to refer to them. For example, the following code works
These values are well documented, and the difference we see in emojis accross platforms stems not from differences in unicode values, but in platform specific ways of animating these symbols.
The Problem
Some emojis in the .htm files given are converted from their usual unicode representation to python src encodings. The above laughing crying face has a src value of
"\U000fe334"
, which cannot easily be used interchangably with their corresponding emojis. There are two dictionaries in the fileemoji_values.py
, one that contains a mapping of python src encodings to colloquial names of their corresponding emojis, and one that takes those names and maps them to unicode values. For the emoji sample used thus far:The issue is that while the second list EMOJI_UNICODE is complete, the SRC_CODES list is missing many emojis. As a result there are occasions when a python src encoding is encountered that is not in the dictionary, and thus cannot be converted to unicode
The Solution
Since the unicode dictionary is more complete than the src encodings dictionary, we can use the unicode dictionary to look up equivalent src encodings. The solution will essentially be to go through every element in the EMOJI_UNICODE dictionary that doesn't have an equivilant in the src encodings dictionary, and add an appropriate element. For example:
If we encounter a key and value from EMOJI_UNICODE without a corresponding src encoding key/ value pair, say
u':blue_heart:': u'\U0001F499'
we could proceed to look up the unicode valueu'\U0001F499'
and try to find a page that has src encodings values for this emoji. This page contains the information we need. If you click on this link and scroll to the bottom of the encodings table, we can see that the src encoding isu'\U000FEB13'
(To find this try using ctr+f and searching for the code). To finish up this emoji, we should add this information to the src codes dictionary, specifically the lineu'\U000feb13': u'BLUE HEART'
would suffice. This maps the codeu'\U000FEB13'
to the string"BLUE HEART"
which can then be mapped to our existing unicode.The above should be done for all elements in the EMOJI_UNICODE dictionary that is not in the SRC_CODES_TO_CAP_NAMES.