Islanna commented 5 years ago

Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.

Languages

Full 2018 dataset size

en    3918333
ja    2603697
ar    1416102
es    1237730
pt    869292 
th    620532 
ko    493476 
fr    349677 
tr    302217 
tl    129997 
id    109838 
it    86488  
de    85671  
ru    84824

Emoji merging

exclude_emojis = ['🙌','👊','🎶','💁','✋','🎧','🔫','🙅','👀','💯']

merge_dict = {
    '💕':'😍',
    '❤':'😍',
    '💙':'😍',
    '♥':'😍',
    '💜':'😍',
    '💖':'😍',
    '💟':'😍',
    '😘':'😍',
    '😉':'😏',
    '😢':'😭',
    '😁':'😊',
    '😄':'😊',
    '😌':'😊',
    '☺':'😊',
    '👌':'👍',
    '👏':'👍',
    '💪':'👍',
    '✨':'👍',
    '✌':'👍',
    '😋':'😜',
    '😐':'😑',
    '😒':'😑',
    '😕':'😑',
    '😠':'😡',
    '💀':'😡',
    '😤':'😡',
    '😈':'😡',
    '😩':'😔',
    '😞':'😔',
    '😪':'😔',
    '😷':'😔',
    '😴':'😔',
    '🙈':'😅',
    '🙊':'😅',
    '😳':'😅',
    '😫':'😣',  
    '😓':'😣',
    '😖':'😣',
    '😬':'😣',
    '🙏':'😣'
}

Emoji distribution

Distribution in the Russian 2018 dataset (~85k tweets)

{'😂': 21529,
 '😍': 17369,
 '😊': 8777,
 '👍': 6195,
 '😏': 5559,
 '😭': 4556,
 '😅': 4336,
 '😑': 2542,
 '💔': 2481,
 '😣': 2065,
 '😔': 1924,
 '😡': 1884,
 '😎': 1782,
 '😜': 1454}

Probably, we can further merge classes 😜 and 😏 , 😎 and 😊

Vocabulary distribution

Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample. Words - processed text(no numbers and punctuation) split by spaces

Lang	Stratified sample size	Len word vocab	Len ngram vocab
en	96544	50779	303809
ja	96036	224019	5380959
ar	96544	156335	751645
es	96544	63004	304999
pt	96544	45251	240593
th	94232	186816	935840
ko	93859	281594	2105074
fr	95571	53860	286240
tr	95135	122695	469954
tl	80123	52671	255014
id	76675	58604	296513
it	82705	57282	269276
de	79234	56615	332311
ru	84824	93778	492304

Cover

Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.

N%:[10%,...,90%]

Chars

Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.

Japanese and Korean chars look much more like ngrams.

Ngrams

Only for the sample. Calculation for the full dataset is time-consuming.

Words

Islanna commented 5 years ago

Nonstandard languages

ko - found library to split syllables to chars jamotools. Well, no dramatical changes, the word vocabulary is the same size(obviously), the ngram vocabulary is 2x smaller, but still is >1mln. But plot for ngram distribution looks much better.

ar - removed all short vowels and other symbols (harakat, tashkeel?) that interfere. Only 4% of the whole dataset has changed. Word and ngram vocabs are pretty much the same.
th - found out that I've dropped some necessary symbols occasionally during preprocessing: r'\W+' also contains some thai accents. Changed the preprocessing to remain only thai and english chars, added a right tokenization from pythainlp. Final word vocab size is ~20k and ~180k for ngram vocabulary.
tr - is an agglutinative language. Probably it explains why word vocab is larger than Russian, but ngram vocab is comparable. No special tools for tokenization.
ja - removed it from final data.

Russian normalized dataset

Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k

Islanna commented 5 years ago

Final dataset distribution

Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian. Removed Japanese and Tagalog.

Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather

Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather

Emoji distribution

Similar to Russian merged distribution, but can differ a little:

'😂': 0.25, '😍': 0.23, '😊': 0.13, '😏': 0.08, '😭': 0.07, '👍': 0.07, '😅': 0.05, '😑': 0.04, '😔': 0.03, '😣': 0.03, '😡': 0.02

The smallest class '😡' in Indonesian is only 3.8k. In Russian it's ~6k.

Vocabs

Preprocessing

Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.

Regular expressions for removing extra chars:

lang_unuse = {'en':'[^a-zA-Z]',
             'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
             'es':'[^a-zA-ZáéíóúüñÁÉÍÓÚÜÑ]',
             'th':'[^\u0E00-\u0E7Fa-zA-Z]',
             'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
             'fr':'[^a-zA-ZÀ-ÿ]',
             'tr':'[^a-zA-ZğşöçĞŞÖÇıIiİuUüÜ]',
             'id':'[^a-zA-Z]',
             'it':'[^a-zA-Z]',
             'de':'[^a-zA-ZÀ-ÿ]',
             'ru':'[^a-zA-Zа-яА-ЯЁё]'}

Balancing

Sizes in final dataset:

Lang	Lang size	Word vocab	Ngram vocab
en	299995	95640	522879
ar	199993	253023	1127338
es	299995	117597	498495
th	349995	46542	331081
ko	198561	515949	1859535
fr	299995	99587	475570
tr	199993	201967	671532
id	199357	100246	457841
it	210703	95578	397849
de	184109	99169	515266
ru	241117	172594	810772

All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.

Ngram vocab cut