mesolitica / malaysian-dataset

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
https://malaysian-dataset.readthedocs.io/
Apache License 2.0
297 stars 106 forks source link

bar plot languages count LLM dataset #41

Open huseinzol05 opened 1 year ago

huseinzol05 commented 1 year ago

Fasttext model trained on,

lang_labels_v2 = {
    0: 'standard-english',
    1: 'local-english',
    2: 'manglish',
    3: 'standard-indonesian',
    4: 'socialmedia-indonesian',
    5: 'standard-malay',
    6: 'local-malay',
    7: 'standard-mandarin',
    8: 'local-mandarin',
    9: 'other',
}

Steps to reproduce the fasttext training at https://github.com/huseinzol05/malaya/blob/5.1/pretrained-model/language-detection-v2/train-fasttext-auto.ipynb

huseinzol05 commented 1 year ago

Image