Open lintangsutawika opened 9 months ago
MOT corpus is very small (35M tokens of English, less than 1B tokens total) but worth mentioning for completeness
ProPublica, DemocracyNow, and The Conversation use CC BY-NC-ND 3.0
Agência Pública and Mongabay use CC BY-ND
Tasnim uses CC BY 4.0
OpenDemocray only says it uses CC
@StellaAthena
@lintangsutawika OpenDemocracy is CC-BY-NC if you go look at the individual articles.
So it seems that Tasnim is the only source that has a sufficiently permissive license.
That's too bad, hoping to balance Tasnim with news source biased towards its opposite. Without news sites, it would be hard to train a model with current/recent events.
Found a site that indexes news and also has tags on what specific CC License they use. https://opennewswire.org/feed/
Similar to #17
ProPublicaDemocracyNowThe ConversationAgência PúblicaOpenDemocracyMongabayCC BY
CC BY-SA
Public Domain