Open Islanna opened 5 years ago
jamotools
. Well, no dramatical changes, the word vocabulary is the same size(obviously), the ngram vocabulary is 2x smaller, but still is >1mln. But plot for ngram distribution looks much better.r'\W+'
also contains some thai accents. Changed the preprocessing to remain only thai and english chars, added a right tokenization from pythainlp
. Final word vocab size is ~20k and ~180k for ngram vocabulary.Normalized russian data: vocab size has decreased 4 times, from ~90k to ~25k
Languages: English, Arabic, Spanish, Thai, Korean, French, Turkish, Indonesian, Italian, German, Russian. Removed Japanese and Tagalog.
Path to the balanced file: nvme/islanna/emoji_sentiment/data/fin_tweets.feather
Path to the file without balancing for languages above: nvme/islanna/emoji_sentiment/data/twitter_proc_full.feather
Similar to Russian merged distribution, but can differ a little:
'π': 0.25, 'π': 0.23, 'π': 0.13, 'π': 0.08, 'π': 0.07, 'π': 0.07, 'π ': 0.05, 'π': 0.04, 'π': 0.03, 'π£': 0.03, 'π‘': 0.02
The smallest class 'π‘' in Indonesian is only 3.8k. In Russian it's ~6k.
Vocabs contain only Latin chars and symbols from the particular language. Korean and Thai was processed separately from the other set.
Regular expressions for removing extra chars:
lang_unuse = {'en':'[^a-zA-Z]',
'ar':'[^\u0600-\u06FFa-zA-Z]', #\u0621-\u064A maybe
'es':'[^a-zA-ZÑéΓΓ³ΓΊΓΌΓ±ΓΓΓΓΓΓΓ]',
'th':'[^\u0E00-\u0E7Fa-zA-Z]',
'ko':'[^\uAC00-\uD7A3\u1100-\u11FF\u3130-\u318Fa-zA-Z]',
'fr':'[^a-zA-ZΓ-ΓΏ]',
'tr':'[^a-zA-ZΔΕΓΆΓ§ΔΕΓΓΔ±IiΔ°uUΓΌΓ]',
'id':'[^a-zA-Z]',
'it':'[^a-zA-Z]',
'de':'[^a-zA-ZΓ-ΓΏ]',
'ru':'[^a-zA-ZΠ°-ΡΠ-Π―ΠΡ]'}
Sizes in final dataset:
Lang | Lang size | Word vocab | Ngram vocab |
---|---|---|---|
en | 299995 | 95640 | 522879 |
ar | 199993 | 253023 | 1127338 |
es | 299995 | 117597 | 498495 |
th | 349995 | 46542 | 331081 |
ko | 198561 | 515949 | 1859535 |
fr | 299995 | 99587 | 475570 |
tr | 199993 | 201967 | 671532 |
id | 199357 | 100246 | 457841 |
it | 210703 | 95578 | 397849 |
de | 184109 | 99169 | 515266 |
ru | 241117 | 172594 | 810772 |
All languages are different. For example, Thai dataset should be 10 times larger than Russian to have the same ngram vocabulary size, while Korean and Arabic shoul be 2-4 times smaller. I suppose, the only way to keep a real balance is to cut the ngram vocabulary for the difficult languages before model training.
Ngram vocab cut
Word vocab cut
@Islanna Some formatting ideas to paste data into an article for easier storytelling Also describing what you did with words will also help
@Islanna updated file for article
Decided to balance a full dataset according to the emoji and ngram distribution in the Russian subset.
Languages
Full 2018 dataset size
Emoji merging
Emoji distribution
Distribution in the Russian 2018 dataset (~85k tweets)
Probably, we can further merge classes π and π , π and π
Vocabulary distribution
Stratified sample - random sample from dataset with the same emoji distribution as in Russian. Max size - 100k. Word and ngram vocabs are calculated for the stratified sample. Words - processed text(no numbers and punctuation) split by spaces
Cover
Extract top N% chars/ngrams/words and check the cover of full dataset/stratified sample.
N%:[10%,...,90%]
Chars
Most of the unpopular chars are chars from other languages: English letters in the Russian dataset, for example. Maybe, these extra characters should be removed.
Japanese and Korean chars look much more like ngrams.
Ngrams
Only for the sample. Calculation for the full dataset is time-consuming.
Words