Look into Language specific model prediction.

nathanfletcher commented 2 years ago

This effort is the path to help make Kindly support multiple languages in the future.

I'll also put this issue under #2 to make it a sub-issue

Simpleshell3 commented 2 years ago

@nathanfletcher @lacabra Hello, I would love to work on this task. Can I assigned to it. Warm regards.

Simpleshell3 commented 2 years ago

@nathanbaleeta @lacabra for a start, I am looking into a mechanism to add support for French, German and Spanish.

lacabra commented 2 years ago

Hi @Simpleshell3, apologies for the slight delay in getting back to you on this one. Can you please elaborate more on what you have in mind? As there are multiple and complementary ways to approach it: from automatic language detection to including an optional language flag that then directs to separate ML models.

lacabra commented 2 years ago

Disclaimer: we have some potential collaborators outside of Outreachy that may be working on this, so this is why we are not assigning this right away (also this is potentially a very big scope, much bigger than what we would expect you to tackle at this stage). Having said this, we're curious to learn more how you would approach this. Thank you 🙏

Simpleshell3 commented 2 years ago

@lacabra Thank you. The model that powers Kindly for offensive tweet classification is Twitter-roBERTa-base for Offensive Language Identification. This model doesn't have multiple language support. This is a roBERTa-base model trained on about 58M tweets and finetuned for offensive language identification for English-Only. I am proposing a model XLM-T - A Multilingual Language Model Toolkit for Twitter which is a pre-trained multilingual language model trained on 200M tweets for 30+ languages.. This model can be fine-tuned for offensive tweet classification which I believe would address this issue #15. The base model is plausible as it has already been finetuned here for multi language sentiment analysis as shown in the attachment alllang .

nathanfletcher commented 2 years ago

Hi @Simpleshell3, thanks for looking into this and providing these amazing stats with your proposal.

Upon further examination, there are a number of concerns in technical deployment that come up when working with this strategy.

It is good that this model targets 8 languages however, in terms of scale, this will make the final model file very large. This will increase the time it takes to extract information and analysis from the model as it would require more technical resources.
It would be best to use smaller, more specialized models trained on individual languages and deploy them.
This will allow us to quickly integrate different languages into the process since they don't have to exist or be trained in this one base model. That way from the API, it would be much easier to tag the language whose model is needed for analysis.

ewheeler commented 2 years ago

+1 for XLM models. Obtaining enough high-quality training data outside of English, French, and a handful of other popular languages would be extremely difficult. XLM models can effectively transfer their learnings from a large English language corpus to low-resource languages where comparably large, high-quality corpuses just don't exist.

https://arxiv.org/pdf/1901.07291 https://arxiv.org/abs/1911.02116

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

It would take a tremendous amount of effort to find/create training datasets, train, fine-tune, evaluate, and deploy enough single-language models to match XLM-R's capabilities.

unicef / kindly

Look into Language specific model prediction. #15