Open nathanfletcher opened 2 years ago
@nathanfletcher @lacabra Hello, I would love to work on this task. Can I assigned to it. Warm regards.
@nathanbaleeta @lacabra for a start, I am looking into a mechanism to add support for French, German and Spanish.
Hi @Simpleshell3, apologies for the slight delay in getting back to you on this one. Can you please elaborate more on what you have in mind? As there are multiple and complementary ways to approach it: from automatic language detection to including an optional language flag that then directs to separate ML models.
Disclaimer: we have some potential collaborators outside of Outreachy that may be working on this, so this is why we are not assigning this right away (also this is potentially a very big scope, much bigger than what we would expect you to tackle at this stage). Having said this, we're curious to learn more how you would approach this. Thank you 🙏
@lacabra Thank you. The model that powers Kindly for offensive tweet classification is Twitter-roBERTa-base for Offensive Language Identification. This model doesn't have multiple language support. This is a roBERTa-base model trained on about 58M tweets and finetuned for offensive language identification for English-Only. I am proposing a model XLM-T - A Multilingual Language Model Toolkit for Twitter which is a pre-trained multilingual language model trained on 200M tweets for 30+ languages.. This model can be fine-tuned for offensive tweet classification which I believe would address this issue #15. The base model is plausible as it has already been finetuned here for multi language sentiment analysis as shown in the attachment .
Hi @Simpleshell3, thanks for looking into this and providing these amazing stats with your proposal.
Upon further examination, there are a number of concerns in technical deployment that come up when working with this strategy.
+1 for XLM models. Obtaining enough high-quality training data outside of English, French, and a handful of other popular languages would be extremely difficult. XLM models can effectively transfer their learnings from a large English language corpus to low-resource languages where comparably large, high-quality corpuses just don't exist.
https://arxiv.org/pdf/1901.07291 https://arxiv.org/abs/1911.02116
XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
It would take a tremendous amount of effort to find/create training datasets, train, fine-tune, evaluate, and deploy enough single-language models to match XLM-R's capabilities.
This effort is the path to help make Kindly support multiple languages in the future.
I'll also put this issue under #2 to make it a sub-issue