oscar-project / corpus

corpus issues.
Apache License 2.0
5 stars 0 forks source link

Low size of Swahili Oscar #16

Open hadyelsahar opened 2 years ago

hadyelsahar commented 2 years ago

I wonder if there's a reason behind the small size of Swahili 7MB in the latest release and 13 MB overall

ps: There's a Swahili Wikipedia with 68K articles. If you need help to extract text from the dump let me know, I can forward you some scripts. https://sw.wikipedia.org/