snakers4 / emoji_sentiment

4 stars 0 forks source link

Dataset balancing #2

Open Islanna opened 5 years ago

Islanna commented 5 years ago

Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.

Languages

Full 2018 dataset size

en    3918333
ja    2603697
ar    1416102
es    1237730
pt    869292 
th    620532 
ko    493476 
fr    349677 
tr    302217 
tl    129997 
id    109838 
it    86488  
de    85671  
ru    84824  

Emoji merging

exclude_emojis = ['πŸ™Œ','πŸ‘Š','🎢','πŸ’','βœ‹','🎧','πŸ”«','πŸ™…','πŸ‘€','πŸ’―']

merge_dict = {
    'πŸ’•':'😍',
    '❀':'😍',
    'πŸ’™':'😍',
    'β™₯':'😍',
    'πŸ’œ':'😍',
    'πŸ’–':'😍',
    'πŸ’Ÿ':'😍',
    '😘':'😍',
    'πŸ˜‰':'😏',
    '😒':'😭',
    '😁':'😊',
    'πŸ˜„':'😊',
    '😌':'😊',
    '☺':'😊',
    'πŸ‘Œ':'πŸ‘',
    'πŸ‘':'πŸ‘',
    'πŸ’ͺ':'πŸ‘',
    '✨':'πŸ‘',
    '✌':'πŸ‘',
    'πŸ˜‹':'😜',
    '😐':'πŸ˜‘',
    'πŸ˜’':'πŸ˜‘',
    'πŸ˜•':'πŸ˜‘',
    '😠':'😑',
    'πŸ’€':'😑',
    '😀':'😑',
    '😈':'😑',
    '😩':'πŸ˜”',
    '😞':'πŸ˜”',
    'πŸ˜ͺ':'πŸ˜”',
    '😷':'πŸ˜”',
    '😴':'πŸ˜”',
    'πŸ™ˆ':'πŸ˜…',
    'πŸ™Š':'πŸ˜…',
    '😳':'πŸ˜…',
    '😫':'😣',  
    'πŸ˜“':'😣',
    'πŸ˜–':'😣',
    '😬':'😣',
    'πŸ™':'😣'
}

Emoji distribution

Distribution in the Russian 2018 dataset (~85k tweets)

{'πŸ˜‚': 21529,
 '😍': 17369,
 '😊': 8777,
 'πŸ‘': 6195,
 '😏': 5559,
 '😭': 4556,
 'πŸ˜…': 4336,
 'πŸ˜‘': 2542,
 'πŸ’”': 2481,
 '😣': 2065,
 'πŸ˜”': 1924,
 '😑': 1884,
 '😎': 1782,
 '😜': 1454}

Probably, we can further merge classes 😜 and 😏 , 😎 and 😊

Vocabulary distribution

Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample. Words - processed text(no numbers and punctuation) split by spaces

Lang Stratified sample size Len word vocab Len ngram vocab
en 96544 50779 303809
ja 96036 224019 5380959
ar 96544 156335 751645
es 96544 63004 304999
pt 96544 45251 240593
th 94232 186816 935840
ko 93859 281594 2105074
fr 95571 53860 286240
tr 95135 122695 469954
tl 80123 52671 255014
id 76675 58604 296513
it 82705 57282 269276
de 79234 56615 332311
ru 84824 93778 492304

Cover

Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.

N%:[10%,...,90%]

Chars

Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.

full_chars sample_chars

Japanese and Korean chars look much more like ngrams.

Ngrams

Only for the sample. Calculation for the full dataset is time-consuming.

sample_ngrams

Words

full_words sample_words
Islanna commented 5 years ago

Nonstandard languages

koupd

Russian normalized dataset

Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k

Islanna commented 5 years ago

Final dataset distribution

Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian. Removed Japanese and Tagalog.

Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather

Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather

Emoji distribution

Similar to Russian merged distribution, but can differ a little:

'πŸ˜‚': 0.25, '😍': 0.23, '😊': 0.13, '😏': 0.08, '😭': 0.07, 'πŸ‘': 0.07, 'πŸ˜…': 0.05, 'πŸ˜‘': 0.04, 'πŸ˜”': 0.03, '😣': 0.03, '😑': 0.02

The smallest class '😑' in Indonesian is only 3.8k. In Russian it's ~6k.

Vocabs

Preprocessing

Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.

Regular expressions for removing extra chars:

lang_unuse = {'en':'[^a-zA-Z]',
             'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
             'es':'[^a-zA-ZΓ‘Γ©Γ­Γ³ΓΊΓΌΓ±ΓΓ‰ΓΓ“ΓšΓœΓ‘]',
             'th':'[^\u0E00-\u0E7Fa-zA-Z]',
             'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
             'fr':'[^a-zA-ZΓ€-ΓΏ]',
             'tr':'[^a-zA-ZΔŸΕŸΓΆΓ§ΔžΕžΓ–Γ‡Δ±IiΔ°uUüÜ]',
             'id':'[^a-zA-Z]',
             'it':'[^a-zA-Z]',
             'de':'[^a-zA-ZΓ€-ΓΏ]',
             'ru':'[^a-zA-ZΠ°-яА-ЯЁё]'}

Balancing

Sizes in final dataset:

Lang Lang size Word vocab Ngram vocab
en 299995 95640 522879
ar 199993 253023 1127338
es 299995 117597 498495
th 349995 46542 331081
ko 198561 515949 1859535
fr 299995 99587 475570
tr 199993 201967 671532
id 199357 100246 457841
it 210703 95578 397849
de 184109 99169 515266
ru 241117 172594 810772

All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.

Ngram vocab cut

ngram vocab

Word vocab cut

word vocab
snakers4 commented 5 years ago

@Islanna Some formatting ideas to paste data into an article for easier storytelling Also describing what you did with words will also help

stats.xlsx

snakers4 commented 5 years ago

stats.xlsx

@Islanna updated file for article