seanlobo / fb-msg-mining

MIT License
9 stars 2 forks source link

Scraping emoji codes from online #4

Open seanlobo opened 8 years ago

seanlobo commented 8 years ago

Emojis

Summary

Emojis are encoded in unicode, just like most other characters. All emojis have unicode values that can be used to refer to them. For example, the following code works

>>> print(u'\U0001F602')
😂
>>> u'\U0001F602' == '😂'
True

These values are well documented, and the difference we see in emojis accross platforms stems not from differences in unicode values, but in platform specific ways of animating these symbols.

The Problem

Some emojis in the .htm files given are converted from their usual unicode representation to python src encodings. The above laughing crying face has a src value of "\U000fe334" , which cannot easily be used interchangably with their corresponding emojis. There are two dictionaries in the file emoji_values.py, one that contains a mapping of python src encodings to colloquial names of their corresponding emojis, and one that takes those names and maps them to unicode values. For the emoji sample used thus far:

>>> print(emoji_values.SRC_CODES_TO_CAP_NAME[u'\U000fe334'])
'FACE WITH TEARS OF JOY'
>>> print(emoji_values.EMOJI_UNICODE['face_with_tears_of_joy'])
u'\U0001F602'
#Note: the fact that the first dictionary uses capitalized names with spaces while the second uses
# lowercase names with underscores is not important, this is just the way I found these libraries
# online. There are methods in `emojis.py` that convert between the two forms

The issue is that while the second list EMOJI_UNICODE is complete, the SRC_CODES list is missing many emojis. As a result there are occasions when a python src encoding is encountered that is not in the dictionary, and thus cannot be converted to unicode

The Solution

Since the unicode dictionary is more complete than the src encodings dictionary, we can use the unicode dictionary to look up equivalent src encodings. The solution will essentially be to go through every element in the EMOJI_UNICODE dictionary that doesn't have an equivilant in the src encodings dictionary, and add an appropriate element. For example:

If we encounter a key and value from EMOJI_UNICODE without a corresponding src encoding key/ value pair, say u':blue_heart:': u'\U0001F499' we could proceed to look up the unicode value u'\U0001F499' and try to find a page that has src encodings values for this emoji. This page contains the information we need. If you click on this link and scroll to the bottom of the encodings table, we can see that the src encoding is u'\U000FEB13' (To find this try using ctr+f and searching for the code). To finish up this emoji, we should add this information to the src codes dictionary, specifically the line u'\U000feb13': u'BLUE HEART' would suffice. This maps the code u'\U000FEB13' to the string "BLUE HEART" which can then be mapped to our existing unicode.

The above should be done for all elements in the EMOJI_UNICODE dictionary that is not in the SRC_CODES_TO_CAP_NAMES.